[jira] [Created] (SPARK-18433) Improve `DataSource.scala` to be more case-insensitive

2016-11-13 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-18433:
-

 Summary: Improve `DataSource.scala` to be more case-insensitive
 Key: SPARK-18433
 URL: https://issues.apache.org/jira/browse/SPARK-18433
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Dongjoon Hyun
Priority: Minor


Currently, `DataSource.scala` partially use `CaseInsensitiveMap` in code-path. 

This issue aims to make `DataSource` to use `CaseInsensitiveMap` generally 
except passing to the other modules (`InMemoryFileIndex` and 
`InsertIntoHadoopFsRelationCommand`). They creates new case-sensitive 
`HadoopConfs` by calling `newHadoopConfWithOptions(options)` inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18432) Fix HDFS block size in programming guide

2016-11-13 Thread Noritaka Sekiyama (JIRA)
Noritaka Sekiyama created SPARK-18432:
-

 Summary: Fix HDFS block size in programming guide
 Key: SPARK-18432
 URL: https://issues.apache.org/jira/browse/SPARK-18432
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.0.1
Reporter: Noritaka Sekiyama
Priority: Minor


http://spark.apache.org/docs/latest/programming-guide.html
"By default, Spark creates one partition for each block of the file (blocks 
being 64MB by default in HDFS)"

Currently default block size in HDFS is 128MB.
The default value has been already increased in Hadoop 2.2.0 (the oldest 
supported version of Spark). https://issues.apache.org/jira/browse/HDFS-4053

Since it looks confusing explanation, I'd like to fix the value from 64MB to 
128MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18431) Hard coded value in org.apache.spark.streaming.kinesis.KinesisReceiver

2016-11-13 Thread Shushant Arora (JIRA)
Shushant Arora created SPARK-18431:
--

 Summary: Hard coded value in 
org.apache.spark.streaming.kinesis.KinesisReceiver
 Key: SPARK-18431
 URL: https://issues.apache.org/jira/browse/SPARK-18431
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.1
Reporter: Shushant Arora


There is a hardcoded value of taskBackoffTimeMillisas 500in onstart method of 
org.apache.spark.streaming.kinesis.KinesisReceiver.Instead of hardcoded value 
it must be configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.

2016-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662955#comment-15662955
 ] 

Apache Spark commented on SPARK-18430:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15878

> Returned Message Null when Hitting an Invocation Exception of Function Lookup.
> --
>
> Key: SPARK-18430
> URL: https://issues.apache.org/jira/browse/SPARK-18430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When the exception is an invocation exception during function lookup, we 
> return a useless/confusing error message:
> For example, 
> {code}
>   df.selectExpr("format_string()")
> {code}
> or 
> {code}
>   df.selectExpr("concat_ws()")
> {code}
> Below is the error message we got:
> {code}
> null; line 1 pos 0
> org.apache.spark.sql.AnalysisException: null; line 1 pos 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18430:


Assignee: Xiao Li  (was: Apache Spark)

> Returned Message Null when Hitting an Invocation Exception of Function Lookup.
> --
>
> Key: SPARK-18430
> URL: https://issues.apache.org/jira/browse/SPARK-18430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When the exception is an invocation exception during function lookup, we 
> return a useless/confusing error message:
> For example, 
> {code}
>   df.selectExpr("format_string()")
> {code}
> or 
> {code}
>   df.selectExpr("concat_ws()")
> {code}
> Below is the error message we got:
> {code}
> null; line 1 pos 0
> org.apache.spark.sql.AnalysisException: null; line 1 pos 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18430:


Assignee: Apache Spark  (was: Xiao Li)

> Returned Message Null when Hitting an Invocation Exception of Function Lookup.
> --
>
> Key: SPARK-18430
> URL: https://issues.apache.org/jira/browse/SPARK-18430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When the exception is an invocation exception during function lookup, we 
> return a useless/confusing error message:
> For example, 
> {code}
>   df.selectExpr("format_string()")
> {code}
> or 
> {code}
>   df.selectExpr("concat_ws()")
> {code}
> Below is the error message we got:
> {code}
> null; line 1 pos 0
> org.apache.spark.sql.AnalysisException: null; line 1 pos 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.

2016-11-13 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18430:

Description: 
When the exception is an invocation exception during function lookup, we return 
a useless/confusing error message:

For example, 
{code}
  df.selectExpr("format_string()")
{code}
or 
{code}
  df.selectExpr("concat_ws()")
{code}

Below is the error message we got:
{code}
null; line 1 pos 0
org.apache.spark.sql.AnalysisException: null; line 1 pos 0
{code}

  was:
When the exception is an invocation exception during function lookup, we return 
a useless/confusing error message:

For example, 
{code}
  df.selectExpr("format_string()")
{code}
or 
{code}
  df.selectExpr("concat_ws()")
{code}

{code}
null; line 1 pos 0
org.apache.spark.sql.AnalysisException: null; line 1 pos 0
{code}


> Returned Message Null when Hitting an Invocation Exception of Function Lookup.
> --
>
> Key: SPARK-18430
> URL: https://issues.apache.org/jira/browse/SPARK-18430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When the exception is an invocation exception during function lookup, we 
> return a useless/confusing error message:
> For example, 
> {code}
>   df.selectExpr("format_string()")
> {code}
> or 
> {code}
>   df.selectExpr("concat_ws()")
> {code}
> Below is the error message we got:
> {code}
> null; line 1 pos 0
> org.apache.spark.sql.AnalysisException: null; line 1 pos 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18430) Returned Message Null when Hitting an Invocation Exception of Function Lookup.

2016-11-13 Thread Xiao Li (JIRA)
Xiao Li created SPARK-18430:
---

 Summary: Returned Message Null when Hitting an Invocation 
Exception of Function Lookup.
 Key: SPARK-18430
 URL: https://issues.apache.org/jira/browse/SPARK-18430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


When the exception is an invocation exception during function lookup, we return 
a useless/confusing error message:

For example, 
{code}
  df.selectExpr("format_string()")
{code}
or 
{code}
  df.selectExpr("concat_ws()")
{code}

{code}
null; line 1 pos 0
org.apache.spark.sql.AnalysisException: null; line 1 pos 0
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18429) implement a new Aggregate for CountMinSketch

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18429:


Assignee: (was: Apache Spark)

> implement a new Aggregate for CountMinSketch
> 
>
> Key: SPARK-18429
> URL: https://issues.apache.org/jira/browse/SPARK-18429
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Zhenhua Wang
>
> Implement a new Aggregate to generate count min sketch, which is a wrapper of 
> CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18429) implement a new Aggregate for CountMinSketch

2016-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662906#comment-15662906
 ] 

Apache Spark commented on SPARK-18429:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/15877

> implement a new Aggregate for CountMinSketch
> 
>
> Key: SPARK-18429
> URL: https://issues.apache.org/jira/browse/SPARK-18429
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Zhenhua Wang
>
> Implement a new Aggregate to generate count min sketch, which is a wrapper of 
> CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18429) implement a new Aggregate for CountMinSketch

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18429:


Assignee: Apache Spark

> implement a new Aggregate for CountMinSketch
> 
>
> Key: SPARK-18429
> URL: https://issues.apache.org/jira/browse/SPARK-18429
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> Implement a new Aggregate to generate count min sketch, which is a wrapper of 
> CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18429) implement a new Aggregate for CountMinSketch

2016-11-13 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-18429:


 Summary: implement a new Aggregate for CountMinSketch
 Key: SPARK-18429
 URL: https://issues.apache.org/jira/browse/SPARK-18429
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Zhenhua Wang


Implement a new Aggregate to generate count min sketch, which is a wrapper of 
CountMinSketch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11496) Parallel implementation of personalized pagerank

2016-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662890#comment-15662890
 ] 

Apache Spark commented on SPARK-11496:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15876

> Parallel implementation of personalized pagerank
> 
>
> Key: SPARK-11496
> URL: https://issues.apache.org/jira/browse/SPARK-11496
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Affects Versions: 2.1.0
>Reporter: Yves Raimond
>Assignee: Yves Raimond
>Priority: Minor
> Fix For: 2.1.0
>
>
> The current implementation of personalized pagerank only supports one source 
> node. Most applications of personalized pagerank require to run the 
> propagation for multiple source nodes. However code such as:
> {code}
> sourceVertices.map { sourceVertex => 
> graph.staticPersonalizedPageRank(sourceVertex, 10) }
> {code}
> Will be very slow, as it needs to run 10 iterations * sourceVertices.size 
> propagation steps.
> It would be good to offer an alternative API that runs personalized pagerank 
> over a list of source vertices in parallel, so that it only needs to run 10 
> propagation steps in the example above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18427) Update docs of mllib.KMeans

2016-11-13 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18427:
-
Component/s: MLlib

> Update docs of mllib.KMeans  
> -
>
> Key: SPARK-18427
> URL: https://issues.apache.org/jira/browse/SPARK-18427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Reporter: zhengruifeng
>Priority: Minor
>
> 1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now 
> disabled
> 2,Add notes about {{k}} that fewer than {{k}} clusters to be returned, 
> according to comments in sources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18428) Update docs for Graph.op

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18428:


Assignee: (was: Apache Spark)

> Update docs for Graph.op
> 
>
> Key: SPARK-18428
> URL: https://issues.apache.org/jira/browse/SPARK-18428
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, GraphX
>Reporter: zhengruifeng
>Priority: Minor
>
> Update {{Summary List of Operators}} and {{VertexRDDs}} to include missing 
> APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18428) Update docs for Graph.op

2016-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662856#comment-15662856
 ] 

Apache Spark commented on SPARK-18428:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15875

> Update docs for Graph.op
> 
>
> Key: SPARK-18428
> URL: https://issues.apache.org/jira/browse/SPARK-18428
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, GraphX
>Reporter: zhengruifeng
>Priority: Minor
>
> Update {{Summary List of Operators}} and {{VertexRDDs}} to include missing 
> APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18428) Update docs for Graph.op

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18428:


Assignee: Apache Spark

> Update docs for Graph.op
> 
>
> Key: SPARK-18428
> URL: https://issues.apache.org/jira/browse/SPARK-18428
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, GraphX
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> Update {{Summary List of Operators}} and {{VertexRDDs}} to include missing 
> APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18428) Update docs for Graph.op

2016-11-13 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18428:


 Summary: Update docs for Graph.op
 Key: SPARK-18428
 URL: https://issues.apache.org/jira/browse/SPARK-18428
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, GraphX
Reporter: zhengruifeng
Priority: Minor


Update {{Summary List of Operators}} and {{VertexRDDs}} to include missing APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18420) Fix the compile errors caused by checkstyle

2016-11-13 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662832#comment-15662832
 ] 

Hyukjin Kwon commented on SPARK-18420:
--

It seems

{code}
[ERROR] src/main/java/org/apache/spark/io/NioBufferedFileInputStream.java:[133] 
(coding) NoFinalizer: Avoid using finalizer method.
{code}

is missed here.

> Fix the compile errors caused by checkstyle
> ---
>
> Key: SPARK-18420
> URL: https://issues.apache.org/jira/browse/SPARK-18420
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: coneyliu
>Priority: Minor
>
> Small fix, fix the compile errors caused by checkstyle.
> Before:
> ```
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] 
> (imports) UnusedImports: Unused import - 
> org.apache.commons.crypto.cipher.CryptoCipherFactory.
> [ERROR] 
> src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] 
> (modifier) RedundantModifier: Redundant 'public' modifier.
> [ERROR] 
> src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71]
>  (sizes) LineLength: Line is longer than 100 characters (found 113).
> [ERROR] 
> src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112]
>  (sizes) LineLength: Line is longer than 100 characters (found 110).
> src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64]
>  (sizes) LineLength: Line is longer than 100 characters (found 103).
> [ERROR] 
> src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] 
> (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
> [ERROR] 
> src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] 
> (regexp) RegexpSingleline: No trailing whitespace allowed.
> ```
> After:
> `mvn install`
> `lint-java`
> Checkstyle checks passed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662784#comment-15662784
 ] 

Dongjoon Hyun commented on SPARK-18413:
---

Yep. Thank you for review. I left a comment for that. After collecting more 
opinions, I'll update the PR together.

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-13 Thread lichenglin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662759#comment-15662759
 ] 

lichenglin commented on SPARK-18413:


 I'm sorry,my network is too bad to download dependencies from maven rep for 
building spark.
I have made a comment on your PR,please check if it is right.
Thanks

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18408) API Improvements for LSH

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18408:


Assignee: Apache Spark

> API Improvements for LSH
> 
>
> Key: SPARK-18408
> URL: https://issues.apache.org/jira/browse/SPARK-18408
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>Assignee: Apache Spark
>
> As the first improvements to current LSH Implementations, we are planning to 
> do the followings:
>  - Change output schema to {{Array of Vector}} instead of {{Vectors}}
>  - Use {{numHashTables}} as the dimension of {{Array}} and 
> {{numHashFunctions}} as the dimension of {{Vector}}
>  - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, 
> {{MinHash}} to {{MinHashLSH}}
>  - Make randUnitVectors/randCoefficients private
>  - Make Multi-Probe NN Search and {{hashDistance}} private for future 
> discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18408) API Improvements for LSH

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18408:


Assignee: (was: Apache Spark)

> API Improvements for LSH
> 
>
> Key: SPARK-18408
> URL: https://issues.apache.org/jira/browse/SPARK-18408
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> As the first improvements to current LSH Implementations, we are planning to 
> do the followings:
>  - Change output schema to {{Array of Vector}} instead of {{Vectors}}
>  - Use {{numHashTables}} as the dimension of {{Array}} and 
> {{numHashFunctions}} as the dimension of {{Vector}}
>  - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, 
> {{MinHash}} to {{MinHashLSH}}
>  - Make randUnitVectors/randCoefficients private
>  - Make Multi-Probe NN Search and {{hashDistance}} private for future 
> discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18408) API Improvements for LSH

2016-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662754#comment-15662754
 ] 

Apache Spark commented on SPARK-18408:
--

User 'Yunni' has created a pull request for this issue:
https://github.com/apache/spark/pull/15874

> API Improvements for LSH
> 
>
> Key: SPARK-18408
> URL: https://issues.apache.org/jira/browse/SPARK-18408
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> As the first improvements to current LSH Implementations, we are planning to 
> do the followings:
>  - Change output schema to {{Array of Vector}} instead of {{Vectors}}
>  - Use {{numHashTables}} as the dimension of {{Array}} and 
> {{numHashFunctions}} as the dimension of {{Vector}}
>  - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, 
> {{MinHash}} to {{MinHashLSH}}
>  - Make randUnitVectors/randCoefficients private
>  - Make Multi-Probe NN Search and {{hashDistance}} private for future 
> discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18412) SparkR spark.randomForest classification throws exception when training on libsvm data

2016-11-13 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18412.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> SparkR spark.randomForest classification throws exception when training on 
> libsvm data
> --
>
> Key: SPARK-18412
> URL: https://issues.apache.org/jira/browse/SPARK-18412
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.1.0
>
>
> {{spark.randomForest}} classification throws exception when training on 
> libsvm data. It can be reproduced as following:
> {code}
> df <- read.df("data/mllib/sample_multiclass_classification_data.txt", source 
> = "libsvm")
> model <- spark.randomForest(df, label ~ features, "classification")
> {code}
> The exception is:
> {code}
> Error in handleErrors(returnStatus, conn) :
>   java.lang.IllegalArgumentException: requirement failed: If label column 
> already exists, forceIndexLabel can not be set with true.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.ml.feature.RFormula.transformSchema(RFormula.scala:205)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>   at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:136)
>   at 
> org.apache.spark.ml.r.RandomForestClassifierWrapper$.fit(RandomForestClassificationWrapper.scala:86)
>   at 
> org.apache.spark.ml.r.RandomForestClassifierWrapper.fit(RandomForestClassificationWrapper.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:172)
> {code}
> This error is caused by the label column of the R formula already exists, we 
> can not force to index label. However, it must index the label for 
> classification algorithms, so we need to rename the RFormula.labelCol to a 
> new value and then we can index the original label.
> This issue also appears at other algorithms: spark.naiveBayes, spark.glm(only 
> for binomial family) and spark.gbt (only for classification).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18408) API Improvements for LSH

2016-11-13 Thread Yun Ni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Ni updated SPARK-18408:
---
Description: 
As the first improvements to current LSH Implementations, we are planning to do 
the followings:
 - Change output schema to {{Array of Vector}} instead of {{Vectors}}
 - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} 
as the dimension of {{Vector}}
 - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} 
to {{MinHashLSH}}
 - Make randUnitVectors/randCoefficients private
 - Make Multi-Probe NN Search and {{hashDistance}} private for future discussion

  was:
As the first improvements to current LSH Implementations, we are planning to do 
the followings:
 - Change output schema to {{Array of Vector}} instead of {{Vectors}}
 - Use {{numHashTables}} as the dimension of {{Array}} and {{numHashFunctions}} 
as the dimension of {{Vector}}
 - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, {{MinHash}} 
to {{MinHashLSH}}
 - Make randUnitVectors/randCoefficients private


> API Improvements for LSH
> 
>
> Key: SPARK-18408
> URL: https://issues.apache.org/jira/browse/SPARK-18408
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Ni
>
> As the first improvements to current LSH Implementations, we are planning to 
> do the followings:
>  - Change output schema to {{Array of Vector}} instead of {{Vectors}}
>  - Use {{numHashTables}} as the dimension of {{Array}} and 
> {{numHashFunctions}} as the dimension of {{Vector}}
>  - Rename {{RandomProjection}} to {{BucketedRandomProjectionLSH}}, 
> {{MinHash}} to {{MinHashLSH}}
>  - Make randUnitVectors/randCoefficients private
>  - Make Multi-Probe NN Search and {{hashDistance}} private for future 
> discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18427) Update docs of mllib.KMeans

2016-11-13 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18427:
-
Description: 
1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now disabled
2,Add notes about {{k}} that fewer than {{k}} clusters to be returned, 
according to comments in sources

  was:
1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now disabled
2,Add notes for {{k}} that according to comments in sources


> Update docs of mllib.KMeans  
> -
>
> Key: SPARK-18427
> URL: https://issues.apache.org/jira/browse/SPARK-18427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: zhengruifeng
>Priority: Minor
>
> 1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now 
> disabled
> 2,Add notes about {{k}} that fewer than {{k}} clusters to be returned, 
> according to comments in sources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18427) Update docs of mllib.KMeans

2016-11-13 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18427:
-
Description: 
1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now disabled
2,Add notes for {{k}} that according to comments in sources

  was:Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs 
should be also updated.


> Update docs of mllib.KMeans  
> -
>
> Key: SPARK-18427
> URL: https://issues.apache.org/jira/browse/SPARK-18427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: zhengruifeng
>Priority: Minor
>
> 1,Remove {{runs}} from docs of {{mllib.KMeans}}, Since {{runs}} is now 
> disabled
> 2,Add notes for {{k}} that according to comments in sources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18427) Update docs of mllib.KMeans

2016-11-13 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-18427:
-
Summary: Update docs of mllib.KMeans(was: Remove 'runs' from docs of 
mllib.KMeans  )

> Update docs of mllib.KMeans  
> -
>
> Key: SPARK-18427
> URL: https://issues.apache.org/jira/browse/SPARK-18427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: zhengruifeng
>Priority: Minor
>
> Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should 
> be also updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18427) Remove 'runs' from docs of mllib.KMeans

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18427:


Assignee: Apache Spark

> Remove 'runs' from docs of mllib.KMeans  
> -
>
> Key: SPARK-18427
> URL: https://issues.apache.org/jira/browse/SPARK-18427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should 
> be also updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18427) Remove 'runs' from docs of mllib.KMeans

2016-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662634#comment-15662634
 ] 

Apache Spark commented on SPARK-18427:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15873

> Remove 'runs' from docs of mllib.KMeans  
> -
>
> Key: SPARK-18427
> URL: https://issues.apache.org/jira/browse/SPARK-18427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: zhengruifeng
>Priority: Minor
>
> Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should 
> be also updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18427) Remove 'runs' from docs of mllib.KMeans

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18427:


Assignee: (was: Apache Spark)

> Remove 'runs' from docs of mllib.KMeans  
> -
>
> Key: SPARK-18427
> URL: https://issues.apache.org/jira/browse/SPARK-18427
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: zhengruifeng
>Priority: Minor
>
> Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should 
> be also updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18427) Remove 'runs' from docs of mllib.KMeans

2016-11-13 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18427:


 Summary: Remove 'runs' from docs of mllib.KMeans  
 Key: SPARK-18427
 URL: https://issues.apache.org/jira/browse/SPARK-18427
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: zhengruifeng
Priority: Minor


Since {{runs}} is disabled in {{mllib.KMeans}}, the corresponding docs should 
be also updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide

2016-11-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18426:

Fix Version/s: 2.0.3

> Python Documentation Fix for Structured Streaming Programming Guide
> ---
>
> Key: SPARK-18426
> URL: https://issues.apache.org/jira/browse/SPARK-18426
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Denny Lee
>Assignee: Denny Lee
>Priority: Minor
>  Labels: documentation
> Fix For: 2.0.3, 2.1.0
>
>
> When running python example in Structured Streaming Guide, get the error:
> spark = SparkSession\
> TypeError: 'Builder' object is not callable
> This is fixed by changing .builder() to .builder 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide

2016-11-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18426.
-
   Resolution: Fixed
 Assignee: Denny Lee
Fix Version/s: (was: 2.0.2)
   2.1.0
   2.0.3

> Python Documentation Fix for Structured Streaming Programming Guide
> ---
>
> Key: SPARK-18426
> URL: https://issues.apache.org/jira/browse/SPARK-18426
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Denny Lee
>Assignee: Denny Lee
>Priority: Minor
>  Labels: documentation
> Fix For: 2.0.3, 2.1.0
>
>
> When running python example in Structured Streaming Guide, get the error:
> spark = SparkSession\
> TypeError: 'Builder' object is not callable
> This is fixed by changing .builder() to .builder 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide

2016-11-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18426:

Fix Version/s: (was: 2.0.3)

> Python Documentation Fix for Structured Streaming Programming Guide
> ---
>
> Key: SPARK-18426
> URL: https://issues.apache.org/jira/browse/SPARK-18426
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Denny Lee
>Assignee: Denny Lee
>Priority: Minor
>  Labels: documentation
> Fix For: 2.1.0
>
>
> When running python example in Structured Streaming Guide, get the error:
> spark = SparkSession\
> TypeError: 'Builder' object is not callable
> This is fixed by changing .builder() to .builder 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17836) Use cross validation to determine the number of clusters for EM or KMeans algorithms

2016-11-13 Thread Aditya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662142#comment-15662142
 ] 

Aditya edited comment on SPARK-17836 at 11/13/16 9:39 PM:
--

I want to work on this issue. Is it fine?


was (Author: aditya1702):
[-Sean Owen] I want to work on this issue. Is it fine?

> Use cross validation to determine the number of clusters for EM or KMeans 
> algorithms
> 
>
> Key: SPARK-17836
> URL: https://issues.apache.org/jira/browse/SPARK-17836
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>Priority: Minor
>
> Sometimes it's not easy for users to determine number of clusters.
> It would be very useful If spark ml can support this. 
> There are several methods to do this according to wiki 
> https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
> Weka uses cross validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17836) Use cross validation to determine the number of clusters for EM or KMeans algorithms

2016-11-13 Thread Aditya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662142#comment-15662142
 ] 

Aditya edited comment on SPARK-17836 at 11/13/16 9:39 PM:
--

[-Sean Owen] I want to work on this issue. Is it fine?


was (Author: aditya1702):
Sean Owen I want to work on this issue. Is it fine?

> Use cross validation to determine the number of clusters for EM or KMeans 
> algorithms
> 
>
> Key: SPARK-17836
> URL: https://issues.apache.org/jira/browse/SPARK-17836
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>Priority: Minor
>
> Sometimes it's not easy for users to determine number of clusters.
> It would be very useful If spark ml can support this. 
> There are several methods to do this according to wiki 
> https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
> Weka uses cross validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17836) Use cross validation to determine the number of clusters for EM or KMeans algorithms

2016-11-13 Thread Aditya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662142#comment-15662142
 ] 

Aditya commented on SPARK-17836:


Sean Owen I want to work on this issue. Is it fine?

> Use cross validation to determine the number of clusters for EM or KMeans 
> algorithms
> 
>
> Key: SPARK-17836
> URL: https://issues.apache.org/jira/browse/SPARK-17836
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Lei Wang
>Priority: Minor
>
> Sometimes it's not easy for users to determine number of clusters.
> It would be very useful If spark ml can support this. 
> There are several methods to do this according to wiki 
> https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
> Weka uses cross validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Improve Date Parsing Functionality

2016-11-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18424:
--
Assignee: Bill Chambers

> Improve Date Parsing Functionality
> --
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18356) Issue + Resolution: Kmeans Spark Performances (ML package)

2016-11-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662092#comment-15662092
 ] 

yuhao yang commented on SPARK-18356:


Checking and caching the training data are quite common in MLlib algorithms. 
Some algorithms (LR, ANN) would persist the rdd data if parents DataFrame are 
not cached (use a variable handlePersistence). We can refer to that for the 
implementation.

> Issue + Resolution: Kmeans Spark Performances (ML package)
> --
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1
>Reporter: zakaria hili
>Priority: Minor
>  Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect 
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I 
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt 
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I 
> realized there is two classes responsible for this warning: 
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>  
> When my  dataframe is cached, the fit method transform my dataframe into an 
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into 
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here 
> spark verify if the rdd is cached, if not a warning will be generated.  
> So, the solution of this problem is to cache the rdd before running Kmeans 
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after 
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-11-13 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15662015#comment-15662015
 ] 

yuhao yang commented on SPARK-18374:


With the default behavior of the _Tokenizer_ and _RegexTokenizer_, I think it's 
more reasonable to directly include words like _won't_, _haven't_ in the stop 
words lists, as shown in the list on http://www.ranks.nl/stopwords.

More specifically, if a user is using the default _Tokenizer_ and 
_RegexTokenizer_ in spark.ml without customization, then _weren_, _wasn_ in 
current stop words list are useless,whereas _weren't_ and _wasn't_ can be 
helpful. The default behavior of ml transformers should be consistent and 
effective.

> Incorrect words in StopWords/english.txt
> 
>
> Key: SPARK-18374
> URL: https://issues.apache.org/jira/browse/SPARK-18374
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: nirav patel
>
> I was just double checking english.txt for list of stopwords as I felt it was 
> taking out valid tokens like 'won'. I think issue is english.txt list is 
> missing apostrophe character and all character after apostrophe. So "won't" 
> becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be 
> part of english.txt as some tokenizer might remove special characters. But 
> 'won' is obviously shouldn't be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18426:


Assignee: Apache Spark

> Python Documentation Fix for Structured Streaming Programming Guide
> ---
>
> Key: SPARK-18426
> URL: https://issues.apache.org/jira/browse/SPARK-18426
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Denny Lee
>Assignee: Apache Spark
>Priority: Minor
>  Labels: documentation
> Fix For: 2.0.2
>
>
> When running python example in Structured Streaming Guide, get the error:
> spark = SparkSession\
> TypeError: 'Builder' object is not callable
> This is fixed by changing .builder() to .builder 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide

2016-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661961#comment-15661961
 ] 

Apache Spark commented on SPARK-18426:
--

User 'dennyglee' has created a pull request for this issue:
https://github.com/apache/spark/pull/15872

> Python Documentation Fix for Structured Streaming Programming Guide
> ---
>
> Key: SPARK-18426
> URL: https://issues.apache.org/jira/browse/SPARK-18426
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Denny Lee
>Priority: Minor
>  Labels: documentation
> Fix For: 2.0.2
>
>
> When running python example in Structured Streaming Guide, get the error:
> spark = SparkSession\
> TypeError: 'Builder' object is not callable
> This is fixed by changing .builder() to .builder 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18426:


Assignee: (was: Apache Spark)

> Python Documentation Fix for Structured Streaming Programming Guide
> ---
>
> Key: SPARK-18426
> URL: https://issues.apache.org/jira/browse/SPARK-18426
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Denny Lee
>Priority: Minor
>  Labels: documentation
> Fix For: 2.0.2
>
>
> When running python example in Structured Streaming Guide, get the error:
> spark = SparkSession\
> TypeError: 'Builder' object is not callable
> This is fixed by changing .builder() to .builder 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide

2016-11-13 Thread Denny Lee (JIRA)
Denny Lee created SPARK-18426:
-

 Summary: Python Documentation Fix for Structured Streaming 
Programming Guide
 Key: SPARK-18426
 URL: https://issues.apache.org/jira/browse/SPARK-18426
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.0.1
Reporter: Denny Lee
Priority: Minor
 Fix For: 2.0.2


When running python example in Structured Streaming Guide, get the error:
spark = SparkSession\
TypeError: 'Builder' object is not callable

This is fixed by changing .builder() to .builder 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame

2016-11-13 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661918#comment-15661918
 ] 

koert kuipers commented on SPARK-15798:
---

turns out the operations needed for this are already mostly available in 
Dataset. the one big limitation is that it seems the secondary sort does not 
get pushed into the shuffle in spark sql (but it is done efficiently with 
spilling to disk etc.). see this conversation:
https://www.mail-archive.com/user@spark.apache.org/msg58844.html

i added support for Dataset secondary sort to spark-sorted. see here:
https://github.com/tresata/spark-sorted

i would also like to add support for DataFrame but to do so i would need 
operations to convert Row to UDF inputs and back, which in spark sql are 
available (Encoder, ScalaReflection, etc.) but they support InternalRow only 
while in a 3rd party library i need to work with normal rows since InternalRows 
are never exposed (for example in Dataset[Row].mapPartitions i have Rows but 
not InternalRows).

> Secondary sort in Dataset/DataFrame
> ---
>
> Key: SPARK-15798
> URL: https://issues.apache.org/jira/browse/SPARK-15798
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: koert kuipers
>
> Secondary sort for Spark RDDs was discussed in 
> https://issues.apache.org/jira/browse/SPARK-3655
> Since the RDD API allows for easy extensions outside the core library this 
> was implemented separately here:
> https://github.com/tresata/spark-sorted
> However it seems to me that with Dataset an implementation in a 3rd party 
> library of such a feature is not really an option.
> Dataset already has methods that suggest a secondary sort is present, such as 
> in KeyValueGroupedDataset:
> {noformat}
> def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): 
> Dataset[U]
> {noformat}
> This operation pushes all the data to the reducer, something you only would 
> want to do if you need the elements in a particular order.
> How about as an API sortBy methods in KeyValueGroupedDataset and 
> RelationalGroupedDataset?
> {noformat}
> dataFrame.groupBy("a").sortBy("b").fold(...)
> {noformat}
> (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should 
> :))
> {noformat}
> dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17116) Allow params to be a {string, value} dict at fit time

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17116:


Assignee: Apache Spark

> Allow params to be a {string, value} dict at fit time
> -
>
> Key: SPARK-17116
> URL: https://issues.apache.org/jira/browse/SPARK-17116
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Manoj Kumar
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, it is possible to override the default params set at constructor 
> time by supplying a ParamMap which is essentially a (Param: value) dict.
> Looking at the codebase, it should be trivial to extend this to a (string, 
> value) representation.
> {code}
> # This hints that the maxiter param of the lr instance is modified in-place
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> lr.fit(dataset, {lr.maxIter: 20})
> # This seems more natural.
> lr.fit(dataset, {"maxIter": 20})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17116) Allow params to be a {string, value} dict at fit time

2016-11-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17116:


Assignee: (was: Apache Spark)

> Allow params to be a {string, value} dict at fit time
> -
>
> Key: SPARK-17116
> URL: https://issues.apache.org/jira/browse/SPARK-17116
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Currently, it is possible to override the default params set at constructor 
> time by supplying a ParamMap which is essentially a (Param: value) dict.
> Looking at the codebase, it should be trivial to extend this to a (string, 
> value) representation.
> {code}
> # This hints that the maxiter param of the lr instance is modified in-place
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> lr.fit(dataset, {lr.maxIter: 20})
> # This seems more natural.
> lr.fit(dataset, {"maxIter": 20})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17116) Allow params to be a {string, value} dict at fit time

2016-11-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661887#comment-15661887
 ] 

Apache Spark commented on SPARK-17116:
--

User 'aditya1702' has created a pull request for this issue:
https://github.com/apache/spark/pull/15871

> Allow params to be a {string, value} dict at fit time
> -
>
> Key: SPARK-17116
> URL: https://issues.apache.org/jira/browse/SPARK-17116
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Manoj Kumar
>Priority: Minor
>
> Currently, it is possible to override the default params set at constructor 
> time by supplying a ParamMap which is essentially a (Param: value) dict.
> Looking at the codebase, it should be trivial to extend this to a (string, 
> value) representation.
> {code}
> # This hints that the maxiter param of the lr instance is modified in-place
> lr = LogisticRegression(maxIter=10, regParam=0.01)
> lr.fit(dataset, {lr.maxIter: 20})
> # This seems more natural.
> lr.fit(dataset, {"maxIter": 20})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18251) DataSet API | RuntimeException: Null value appeared in non-nullable field when holding Option Case Class

2016-11-13 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658492#comment-15658492
 ] 

Aniket Bhatnagar edited comment on SPARK-18251 at 11/13/16 5:17 PM:


Hi [~jayadevan.m]

Which version of scala and spark did you use? I can reproduce this on spark 
2.0.1 and scala 2.11.8. I have created a sample project with all the 
dependencies to easily reproduce this:
https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug
To reproduce the bug, simple checkout the project and run the command sbt run.

Thanks,
Aniket


was (Author: aniket):
Hi [~jayadevan.m]

Which version of scala spark did you use? I can reproduce this on spark 2.0.1 
and scala 2.11.8. I have created a sample project with all the dependencies to 
easily reproduce this:
https://github.com/aniketbhatnagar/SPARK-18251-data-set-option-bug
To reproduce the bug, simple checkout the project and run the command sbt run.

Thanks,
Aniket

> DataSet API | RuntimeException: Null value appeared in non-nullable field 
> when holding Option Case Class
> 
>
> Key: SPARK-18251
> URL: https://issues.apache.org/jira/browse/SPARK-18251
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: OS X
>Reporter: Aniket Bhatnagar
>
> I am running into a runtime exception when a DataSet is holding an Empty 
> object instance for an Option type that is holding non-nullable field. For 
> instance, if we have the following case class:
> case class DataRow(id: Int, value: String)
> Then, DataSet[Option[DataRow]] can only hold Some(DataRow) objects and cannot 
> hold Empty. If it does so, the following exception is thrown:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 6 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 6.0 in stage 0.0 (TID 6, localhost): java.lang.RuntimeException: 
> Null value appeared in non-nullable field:
> - field (class: "scala.Int", name: "id")
> - option value class: "DataSetOptBug.DataRow"
> - root class: "scala.Option"
> If the schema is inferred from a Scala tuple/case class, or a Java bean, 
> please try to use scala.Option[_] or other nullable types (e.g. 
> java.lang.Integer instead of int/scala.Int).
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The bug can be reproduce by using the program: 
> https://gist.github.com/aniketbhatnagar/2ed74613f70d2defe999c18afaa4816e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18413) Add a property to control the number of partitions when save a jdbc rdd

2016-11-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661769#comment-15661769
 ] 

Dongjoon Hyun commented on SPARK-18413:
---

Hi, [~lichenglingl].
Although the code is simple, I cannot find a proper unit testing method for 
this. So, it took a long time.
Could you try the PR for your use case?

> Add a property to control the number of partitions when save a jdbc rdd
> ---
>
> Key: SPARK-18413
> URL: https://issues.apache.org/jira/browse/SPARK-18413
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> {code}
> CREATE or replace TEMPORARY VIEW resultview
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   url "jdbc:oracle:thin:@10.129.10.111:1521:BKDB",
>   dbtable "result",
>   user "HIVE",
>   password "HIVE"
> );
> --set spark.sql.shuffle.partitions=200
> insert overwrite table resultview select g,count(1) as count from 
> tnet.DT_LIVE_INFO group by g
> {code}
> I'm tring to save a spark sql result to oracle.
> And I found spark will create a jdbc connection for each partition.
> if the sql create to many partitions , the database can't hold so many 
> connections and return exception.
> In above situation is 200 because of the "group by" and 
> "spark.sql.shuffle.partitions"
> the spark source code JdbcUtil is
> {code}
> def saveTable(
>   df: DataFrame,
>   url: String,
>   table: String,
>   properties: Properties) {
> val dialect = JdbcDialects.get(url)
> val nullTypes: Array[Int] = df.schema.fields.map { field =>
>   getJdbcType(field.dataType, dialect).jdbcNullType
> }
> val rddSchema = df.schema
> val getConnection: () => Connection = createConnectionFactory(url, 
> properties)
> val batchSize = properties.getProperty(JDBC_BATCH_INSERT_SIZE, 
> "1000").toInt
> df.foreachPartition { iterator =>
>   savePartition(getConnection, table, iterator, rddSchema, nullTypes, 
> batchSize, dialect)
> }
>   }
> {code}
> May be we can add a property for df.repartition(num).foreachPartition ?
> In fact I got an exception "ORA-12519, TNS:no appropriate service handler 
> found"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18421) Dynamic disk allocation

2016-11-13 Thread Aniket Bhatnagar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661760#comment-15661760
 ] 

Aniket Bhatnagar commented on SPARK-18421:
--

I agree that spark doesn't manage the storage and therefore, running of an 
agent and dynamic addition of storage to a host is outside the scope. However, 
what's in scope for spark is ability for spark to use added storage without 
forcing restart of executor process. Specifically, spark.local.dirs needs to be 
a dynamic property. For example, spark.local.dirs could be configured as a glob 
pattern (something like /mnt*) and whenever a new disk is added & mounted (as 
/mnt), spark's shuffle service should be able to use the locally 
added disk. Additionally, there maybe a task to rebalance shuffle blocks once a 
disk is added so that all local dirs are once again used equally. 

I don't think, detection of newly mounted directory, rebalancing of blocks, etc 
is cloud specific as all of this can be done using java's IO/NIO api.

This feature would however be mostly useful for users running in spark on 
cloud. Currently, the users are expected to guess their shuffle storage 
footprint and accordingly mount the right sized disks. If the guess is wrong, 
the job fails, wasting a lot of time.

> Dynamic disk allocation
> ---
>
> Key: SPARK-18421
> URL: https://issues.apache.org/jira/browse/SPARK-18421
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> Dynamic allocation feature allows you to add executors and scale computation 
> power. This is great, however, I feel like we also need a way to dynamically 
> scale storage. Currently, if the disk is not able to hold the spilled/shuffle 
> data, the job is aborted (in yarn, the node manager kills the container) 
> causing frustration and loss of time. In deployments like AWS EMR, it is 
> possible to run an agent that add disks on the fly if it sees that the disks 
> are running out of space and it would be great if Spark could immediately 
> start using the added disks just as it does when new executors are added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-13 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661640#comment-15661640
 ] 

Ran Haim edited comment on SPARK-17436 at 11/13/16 3:37 PM:


Hi,
I only got a chance to work on it now.
I saw that the whole class tree got changed - I changed the code in 
org.apache.spark.sql.execution.datasources.FileFormatWriter.
The problem is I cannot seem to run a mvn clean install...A lot of tests fail 
(not relevant to my change, and happen without it) - And I do want to make sure 
there are relevant tests (though I did not find any).

Any Ideas?

Also I cannot create a pull request, I get 403.

Ran,


was (Author: ran.h...@optimalplus.com):
Hi,
I only got a chance to work on it now.
I saw that the whole class tree got changed - I changed the code in 
org.apache.spark.sql.execution.datasources.FileFormatWriter.
The problem is I cannot seem to run a mvn clean install...A lot of tests fail 
(not relevant to my change, and happen without it) - And I do want to make sure 
there are relevant tests (though I did not find any).

Any Ideas?

Ran,

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17436) dataframe.write sometimes does not keep sorting

2016-11-13 Thread Ran Haim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661640#comment-15661640
 ] 

Ran Haim commented on SPARK-17436:
--

Hi,
I only got a chance to work on it now.
I saw that the whole class tree got changed - I changed the code in 
org.apache.spark.sql.execution.datasources.FileFormatWriter.
The problem is I cannot seem to run a mvn clean install...A lot of tests fail 
(not relevant to my change, and happen without it) - And I do want to make sure 
there are relevant tests (though I did not find any).

Any Ideas?

Ran,

> dataframe.write sometimes does not keep sorting
> ---
>
> Key: SPARK-17436
> URL: https://issues.apache.org/jira/browse/SPARK-17436
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Ran Haim
>
> When using partition by,  datawriter can sometimes mess up an ordered 
> dataframe.
> The problem originates in 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.
> In the writeRows method when too many files are opened (configurable), it 
> starts inserting rows to UnsafeKVExternalSorter, then it reads all the rows 
> again from the sorter and writes them to the corresponding files.
> The problem is that the sorter actually sorts the rows using the partition 
> key, and that can sometimes mess up the original sort (or secondary sort if 
> you will).
> I think the best way to fix it is to stop using a sorter, and just put the 
> rows in a map using key as partition key and value as an arraylist, and then 
> just walk through all the keys and write it in the original order - this will 
> probably be faster as there no need for ordering.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18420) Fix the compile errors caused by checkstyle

2016-11-13 Thread coneyliu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

coneyliu updated SPARK-18420:
-
Component/s: Build

> Fix the compile errors caused by checkstyle
> ---
>
> Key: SPARK-18420
> URL: https://issues.apache.org/jira/browse/SPARK-18420
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: coneyliu
>Priority: Minor
>
> Small fix, fix the compile errors caused by checkstyle.
> Before:
> ```
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] 
> (imports) UnusedImports: Unused import - 
> org.apache.commons.crypto.cipher.CryptoCipherFactory.
> [ERROR] 
> src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] 
> (modifier) RedundantModifier: Redundant 'public' modifier.
> [ERROR] 
> src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71]
>  (sizes) LineLength: Line is longer than 100 characters (found 113).
> [ERROR] 
> src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112]
>  (sizes) LineLength: Line is longer than 100 characters (found 110).
> src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64]
>  (sizes) LineLength: Line is longer than 100 characters (found 103).
> [ERROR] 
> src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] 
> (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
> [ERROR] 
> src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] 
> (regexp) RegexpSingleline: No trailing whitespace allowed.
> ```
> After:
> `mvn install`
> `lint-java`
> Checkstyle checks passed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18420) Fix the compile errors caused by checkstyle

2016-11-13 Thread coneyliu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

coneyliu updated SPARK-18420:
-
Description: 
Small fix, fix the compile errors caused by checkstyle.

Before:
```
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] 
(imports) UnusedImports: Unused import - 
org.apache.commons.crypto.cipher.CryptoCipherFactory.
[ERROR] src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] 
(modifier) RedundantModifier: Redundant 'public' modifier.
[ERROR] 
src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71] 
(sizes) LineLength: Line is longer than 100 characters (found 113).
[ERROR] 
src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112]
 (sizes) LineLength: Line is longer than 100 characters (found 110).
src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64]
 (sizes) LineLength: Line is longer than 100 characters (found 103).
[ERROR] 
src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] 
(imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
[ERROR] 
src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] 
(regexp) RegexpSingleline: No trailing whitespace allowed.
```

After:
`mvn install`
`lint-java`
Checkstyle checks passed

> Fix the compile errors caused by checkstyle
> ---
>
> Key: SPARK-18420
> URL: https://issues.apache.org/jira/browse/SPARK-18420
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: coneyliu
>Priority: Minor
>
> Small fix, fix the compile errors caused by checkstyle.
> Before:
> ```
> Checkstyle checks failed at following occurrences:
> [ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[21,8] 
> (imports) UnusedImports: Unused import - 
> org.apache.commons.crypto.cipher.CryptoCipherFactory.
> [ERROR] 
> src/test/java/org/apache/spark/network/sasl/SparkSaslSuite.java:[516,5] 
> (modifier) RedundantModifier: Redundant 'public' modifier.
> [ERROR] 
> src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeMapData.java:[71]
>  (sizes) LineLength: Line is longer than 100 characters (found 113).
> [ERROR] 
> src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeArrayData.java:[112]
>  (sizes) LineLength: Line is longer than 100 characters (found 110).
> src/main/java/org/apache/spark/examples/ml/JavaLogisticRegressionWithElasticNetExample.java:[64]
>  (sizes) LineLength: Line is longer than 100 characters (found 103).
> [ERROR] 
> src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[22,8] 
> (imports) UnusedImports: Unused import - org.apache.spark.ml.linalg.Vectors.
> [ERROR] 
> src/main/java/org/apache/spark/examples/ml/JavaInteractionExample.java:[51] 
> (regexp) RegexpSingleline: No trailing whitespace allowed.
> ```
> After:
> `mvn install`
> `lint-java`
> Checkstyle checks passed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18421) Dynamic disk allocation

2016-11-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15661520#comment-15661520
 ] 

Sean Owen commented on SPARK-18421:
---

Spark doesn't manage storage at all. I don't think this could be in scope 
therefore especially because it could only apply to cloud and would be cloud 
specific.

> Dynamic disk allocation
> ---
>
> Key: SPARK-18421
> URL: https://issues.apache.org/jira/browse/SPARK-18421
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> Dynamic allocation feature allows you to add executors and scale computation 
> power. This is great, however, I feel like we also need a way to dynamically 
> scale storage. Currently, if the disk is not able to hold the spilled/shuffle 
> data, the job is aborted (in yarn, the node manager kills the container) 
> causing frustration and loss of time. In deployments like AWS EMR, it is 
> possible to run an agent that add disks on the fly if it sees that the disks 
> are running out of space and it would be great if Spark could immediately 
> start using the added disks just as it does when new executors are added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-18363) Connected component for large graph result is wrong

2016-11-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-18363:
---

> Connected component for large graph result is wrong
> ---
>
> Key: SPARK-18363
> URL: https://issues.apache.org/jira/browse/SPARK-18363
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.1
>Reporter: Philip Adetiloye
>
> The clustering done by Graphx connected component doesn't seems to work 
> correctly with large nodes.
> It only works correctly on a small graph



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18363) Connected component for large graph result is wrong

2016-11-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18363.
---
Resolution: Not A Problem

> Connected component for large graph result is wrong
> ---
>
> Key: SPARK-18363
> URL: https://issues.apache.org/jira/browse/SPARK-18363
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.1
>Reporter: Philip Adetiloye
>
> The clustering done by Graphx connected component doesn't seems to work 
> correctly with large nodes.
> It only works correctly on a small graph



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org