[jira] [Commented] (SPARK-10263) Add @Since annotation to ml.param and ml.*

2015-09-29 Thread Hiroshi Takahashi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934717#comment-14934717
 ] 

Hiroshi Takahashi commented on SPARK-10263:
---

I'll work on this issue.

> Add @Since annotation to ml.param and ml.*
> --
>
> Key: SPARK-10263
> URL: https://issues.apache.org/jira/browse/SPARK-10263
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9776) Another instance of Derby may have already booted the database

2015-09-29 Thread KaiXinXIaoLei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934588#comment-14934588
 ] 

KaiXinXIaoLei edited comment on SPARK-9776 at 9/29/15 6:14 AM:
---

In security cluster, and using yarn-client mode, this problem still exist. I 
just run "bin/spark-shell --master yarn-client".


was (Author: kaixinxiaolei):
In security cluster, and using yarn-client mode, this problem still exist.

> Another instance of Derby may have already booted the database 
> ---
>
> Key: SPARK-9776
> URL: https://issues.apache.org/jira/browse/SPARK-9776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Mac Yosemite, spark-1.5.0
>Reporter: Sudhakar Thota
> Attachments: SPARK-9776-FL1.rtf
>
>
> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) results in 
> error. Though the same works for spark-1.4.1.
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934724#comment-14934724
 ] 

Saisai Shao commented on SPARK-10858:
-

[~tgraves], I assume you're using yarn-cluster mode to submit application, 
because the way of yarn-client to deal with {{--jars}} is different and the 
stack shall be different.

The interesting thing is that I get the opposite result compared to yours. I 
succeed without added scheme, but failed with scheme added, here are my two 
commands:

success:

{code}
./bin/spark-submit --master yarn-cluster --queue a --jars 
/Users/sshao/projects/apache-spark/my.jar\#renamed.jar  --class 
org.apache.spark.examples.SparkPi 
examples/target/scala-2.10/spark-examples-1.6.0-SNAPSHOT-hadoop2.6.0.jar 10
{code}

failed:

{code}
./bin/spark-submit --master yarn-cluster --queue a --jars 
file:///Users/sshao/projects/apache-spark/my.jar#renamed.jar  --class 
org.apache.spark.examples.SparkPi 
examples/target/scala-2.10/spark-examples-1.6.0-SNAPSHOT-hadoop2.6.0.jar 10
{code}




> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10851) Exception not failing R applications (in yarn cluster mode)

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935178#comment-14935178
 ] 

Apache Spark commented on SPARK-10851:
--

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/8938

> Exception not failing R applications (in yarn cluster mode)
> ---
>
> Key: SPARK-10851
> URL: https://issues.apache.org/jira/browse/SPARK-10851
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Zsolt Tóth
>
> The bug is the R version of SPARK-7736. The R script fails with an exception 
> but the Yarn application status is SUCCEEDED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10851) Exception not failing R applications (in yarn cluster mode)

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10851:


Assignee: Apache Spark

> Exception not failing R applications (in yarn cluster mode)
> ---
>
> Key: SPARK-10851
> URL: https://issues.apache.org/jira/browse/SPARK-10851
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Zsolt Tóth
>Assignee: Apache Spark
>
> The bug is the R version of SPARK-7736. The R script fails with an exception 
> but the Yarn application status is SUCCEEDED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10851) Exception not failing R applications (in yarn cluster mode)

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10851:


Assignee: (was: Apache Spark)

> Exception not failing R applications (in yarn cluster mode)
> ---
>
> Key: SPARK-10851
> URL: https://issues.apache.org/jira/browse/SPARK-10851
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Zsolt Tóth
>
> The bug is the R version of SPARK-7736. The R script fails with an exception 
> but the Yarn application status is SUCCEEDED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10851) Exception not failing R applications (in yarn cluster mode)

2015-09-29 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935181#comment-14935181
 ] 

Sun Rui commented on SPARK-10851:
-

[~ztoth] does this PR fix your problem?

> Exception not failing R applications (in yarn cluster mode)
> ---
>
> Key: SPARK-10851
> URL: https://issues.apache.org/jira/browse/SPARK-10851
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Zsolt Tóth
>
> The bug is the R version of SPARK-7736. The R script fails with an exception 
> but the Yarn application status is SUCCEEDED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10268) Add @Since annotation to ml.tree

2015-09-29 Thread Hiroshi Takahashi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934706#comment-14934706
 ] 

Hiroshi Takahashi commented on SPARK-10268:
---

I think there are no public classes or methods without DeveloperApi in ml.tree.
(But parent issue says we should add annotations to stable and experimental 
methods.)
Should I add anotations to private, package private, or DeveloperApi?

> Add @Since annotation to ml.tree
> 
>
> Key: SPARK-10268
> URL: https://issues.apache.org/jira/browse/SPARK-10268
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast

2015-09-29 Thread Dilip Biswal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934731#comment-14934731
 ] 

Dilip Biswal commented on SPARK-8654:
-

I would like to work on this issue.. 

> Analysis exception when using "NULL IN (...)": invalid cast
> ---
>
> Key: SPARK-8654
> URL: https://issues.apache.org/jira/browse/SPARK-8654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Santiago M. Mola
>Priority: Minor
>
> The following query throws an analysis exception:
> {code}
> SELECT * FROM t WHERE NULL NOT IN (1, 2, 3);
> {code}
> The exception is:
> {code}
> org.apache.spark.sql.AnalysisException: invalid cast from int to null;
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
> {code}
> Here is a test that can be added to AnalysisSuite to check the issue:
> {code}
>   test("SPARK- regression test") {
> val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), 
> "a")() :: Nil,
>   LocalRelation()
> )
> caseInsensitiveAnalyze(plan)
>   }
> {code}
> Note that this kind of query is a corner case, but it is still valid SQL. An 
> expression such as "NULL IN (...)" or "NULL NOT IN (...)" always gives NULL 
> as a result, even if the list contains NULL. So it is safe to translate these 
> expressions to Literal(null) during analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10263) Add @Since annotation to ml.param and ml.*

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10263:


Assignee: (was: Apache Spark)

> Add @Since annotation to ml.param and ml.*
> --
>
> Key: SPARK-10263
> URL: https://issues.apache.org/jira/browse/SPARK-10263
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10263) Add @Since annotation to ml.param and ml.*

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934751#comment-14934751
 ] 

Apache Spark commented on SPARK-10263:
--

User 'taishi-oss' has created a pull request for this issue:
https://github.com/apache/spark/pull/8935

> Add @Since annotation to ml.param and ml.*
> --
>
> Key: SPARK-10263
> URL: https://issues.apache.org/jira/browse/SPARK-10263
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10867) df.write.partitionBy With Two Columns Collapses first Column

2015-09-29 Thread Bryan Rivera (JIRA)
Bryan Rivera created SPARK-10867:


 Summary: df.write.partitionBy With Two Columns Collapses first 
Column
 Key: SPARK-10867
 URL: https://issues.apache.org/jira/browse/SPARK-10867
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Bryan Rivera


With the following Spark Streaming code the directory structure should be:

```
/base
/long_column=1
 /string_column=a
 /string_column=b
/long_column=2
 /string_column=a
 /string_column=b
```

But instead is:

```
/base
/long_column=1
 /string_column=a
 /string_column=b
```

The long_column=2 files are being written under long_column=1 by appending to 
its child directories.


```
dStream.foreachRDD{ rdd =>

implicit val sqlContext = SQLContext.getOrCreate(rdd.context)
import sqlContext.implicits._

rdd.toDF.write.partitionBy("long_column", "string_column")
  .mode(SaveMode.Append)
  .parquet(filePath)
}
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935129#comment-14935129
 ] 

Thomas Graves commented on SPARK-10858:
---

yes its a bad thing as users don't know when # works.  It should work in all 
cases, file://, hdfs://. The default is file:// so I would expect it to act the 
same whether you specify the scheme or not since that is the default.

[~jerryshao]  what was the error you got in the failed case?  You escaped the # 
in the first case and now the second. what platform are you on?

I was assuming it was failing when the scheme was explicit because we are using 
getFragment() for perhaps it wasn't fully parsing the URI without the scheme.

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935129#comment-14935129
 ] 

Thomas Graves edited comment on SPARK-10858 at 9/29/15 1:06 PM:


yes its a bad thing as users don't know when # works.  It should work in all 
cases, file://, hdfs://. The default is file:// so I would expect it to act the 
same whether you specify the scheme or not since that is the default.

[~jerryshao]  what was the error you got in the failed case?  You escaped the # 
in the first case and not the second. what platform are you on?

I was assuming it was failing when the scheme was explicit because we are using 
getFragment() for perhaps it wasn't fully parsing the URI without the scheme.


was (Author: tgraves):
yes its a bad thing as users don't know when # works.  It should work in all 
cases, file://, hdfs://. The default is file:// so I would expect it to act the 
same whether you specify the scheme or not since that is the default.

[~jerryshao]  what was the error you got in the failed case?  You escaped the # 
in the first case and now the second. what platform are you on?

I was assuming it was failing when the scheme was explicit because we are using 
getFragment() for perhaps it wasn't fully parsing the URI without the scheme.

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935134#comment-14935134
 ] 

Thomas Graves commented on SPARK-10858:
---

Note the # is the name that we give it on the yarn side that the executor 
actually see so it doesn't matter where the file is originating from (file:// 
or hdfs://).

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)
Julien Genini created SPARK-10869:
-

 Summary: Auto-normalization of semi-structured schema from a 
dataframe
 Key: SPARK-10869
 URL: https://issues.apache.org/jira/browse/SPARK-10869
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 1.5.1
Reporter: Julien Genini
Priority: Minor


today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (linear, default False)
with the path for each field, and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(linear=True)

>>
{'fields': [{'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'pathName': 'SiteXML.BusinessDate',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Group',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Site',
 'type': 'string'},
{'metadata': {},
 'name': 'label',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label',
 'type': 'string'},
{'metadata': {},
 'name': 'label_group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label_group',
 'type': 'string'},
{'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'pathName': 'SiteXML.TimeStamp',
 'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
   {'name': 'SiteXML', 'nbFields': 1},
   {'name': 'SiteXML.Site_List', 'nbFields': 0},
   {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Genini updated SPARK-10869:
--
Description: 
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{'fields': [{'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'pathName': 'SiteXML.BusinessDate',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Group',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Site',
 'type': 'string'},
{'metadata': {},
 'name': 'label',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label',
 'type': 'string'},
{'metadata': {},
 'name': 'label_group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label_group',
 'type': 'string'},
{'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'pathName': 'SiteXML.TimeStamp',
 'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
   {'name': 'SiteXML', 'nbFields': 1},
   {'name': 'SiteXML.Site_List', 'nbFields': 0},
   {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}




  was:
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (linear, default False)
with the path for each field, and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(linear=True)

>>
{'fields': [{'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'pathName': 'SiteXML.BusinessDate',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Group',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Site',
 'type': 'string'},
{'metadata': {},
 'name': 'label',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label',
 'type': 'string'},
{'metadata': {},
 'name': 'label_group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label_group',
 'type': 'string'},
{'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'pathName': 'SiteXML.TimeStamp',
 'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
   {'name': 'SiteXML', 'nbFields': 1},
   {'name': 'SiteXML.Site_List', 'nbFields': 0},
   {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}





> Auto-normalization of semi-structured schema from a dataframe
> -
>
> Key: SPARK-10869
> URL: https://issues.apache.org/jira/browse/SPARK-10869
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Julien Genini
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> today, you can get a multi-depth schema from a semi-structured dataframe. 
> (XML, JSON, etc..)
> Not so easy to deal in data warehousing where it's better to normalize the 
> data.
> I propose an option to add when you get the schema (normalized, default False)
> Then the returned json schema will contains the normalized path for each 
> field, and the list of the different node levels
> df = sqlContext.read.json(jsonPath)
> jsonLinearSchema = df.schema.jsonValue(normalized=True)
> >>
> {'fields': [{'metadata': {},  
>   
>  'name': 'BusinessDate',
>  'nullable': True,
>  'pathName': 

[jira] [Comment Edited] (SPARK-7135) Expression for monotonically increasing IDs

2015-09-29 Thread Martin Senne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934947#comment-14934947
 ] 

Martin Senne edited comment on SPARK-7135 at 9/29/15 10:04 AM:
---

@[~rxin]: First, great feature. Unfortunately, there seems to be no way to set 
an offset, where to start indexing at (currently indexing starts always at 0). 
Is there any chance to add this?

As this is my first post here, I'm not sure if this place here. Should I raise 
a feature request here in Jira?


was (Author: martinsenne):
@[~rxin]: First, great feature. Unfortunately, there seems to be no way to set 
an offset, where to start indexing at (currently indexing starts always at 0). 
Is there any chance, to add this?

As this is my first post here, I'm not sure if this place here. Should I raise 
a feature request here in Jira?

> Expression for monotonically increasing IDs
> ---
>
> Key: SPARK-7135
> URL: https://issues.apache.org/jira/browse/SPARK-7135
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: dataframe
> Fix For: 1.4.0
>
>
> Seems like a common use case that users might want a unique ID for each row. 
> It is more expensive to have consecutive IDs, since that'd require two pass 
> over the data. However, many use cases can be satisfied by just having unique 
> ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10868) monotonicallyIncreasingId() supports offset for indexing

2015-09-29 Thread Martin Senne (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Senne updated SPARK-10868:
-
Summary: monotonicallyIncreasingId() supports offset for indexing  (was: 
monotonicallyIncreasingId() supports offset to start indexing at)

> monotonicallyIncreasingId() supports offset for indexing
> 
>
> Key: SPARK-10868
> URL: https://issues.apache.org/jira/browse/SPARK-10868
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Martin Senne
>
> With SPARK-7135 and https://github.com/apache/spark/pull/5709 
> `monotonicallyIncreasingID()` allows to create an index column with unique 
> ids. The indexing always starts at 0 (no offset).
> Feature wish: Having a parameter `offset`, such that the function can be used 
> as
> monotonicallyIncreasingID( offset )
> and indexing *starts at `offset` instead of 0*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-29 Thread Chris Heller (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935043#comment-14935043
 ] 

Chris Heller commented on SPARK-8734:
-

I agree on the goal. Without a total redesign of the Spark properties format, 
this approach will get us most of the way there. 

I noticed in your patch you also unified the SparkConf/MesosProperties logic 
which makes maintaining these special cases simpler. Good documentation should 
provide a level of needed user-friendliness.

+1

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3862) MultiWayBroadcastInnerHashJoin

2015-09-29 Thread David Sabater (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935054#comment-14935054
 ] 

David Sabater commented on SPARK-3862:
--

Thanks Reynold - I am really interested in this feature and happy to contribute 
in whatever format, I see there is a Jira task opened around that actually!
https://issues.apache.org/jira/browse/SPARK-3863

Let me know how I can contribute please, I am actually attending Spark Summit 
EU so we may see each other to talk about potential use cases and ways to 
collaborate.


Regards. 

> MultiWayBroadcastInnerHashJoin
> --
>
> Key: SPARK-3862
> URL: https://issues.apache.org/jira/browse/SPARK-3862
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> It is common to have a single fact table inner join many small dimension 
> tables.  We can exploit this fact and create a MultiWayBroadcastInnerHashJoin 
> (or maybe just MultiwayDimensionJoin) operator that optimizes for this 
> pattern.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10868) monotonicallyIncreasingId() supports offset for indexing

2015-09-29 Thread Martin Senne (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Senne updated SPARK-10868:
-
Description: 
With SPARK-7135 and https://github.com/apache/spark/pull/5709 
`monotonicallyIncreasingID()` allows to create an index column with unique ids. 
The indexing always starts at 0 (no offset).

Feature wish: Having a parameter `offset`, such that the function can be used as

{{monotonicallyIncreasingID( offset )}}

and indexing *starts at `offset` instead of 0*.



  was:
With SPARK-7135 and https://github.com/apache/spark/pull/5709 
`monotonicallyIncreasingID()` allows to create an index column with unique ids. 
The indexing always starts at 0 (no offset).

Feature wish: Having a parameter `offset`, such that the function can be used as

monotonicallyIncreasingID( offset )

and indexing *starts at `offset` instead of 0*.




> monotonicallyIncreasingId() supports offset for indexing
> 
>
> Key: SPARK-10868
> URL: https://issues.apache.org/jira/browse/SPARK-10868
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Martin Senne
>
> With SPARK-7135 and https://github.com/apache/spark/pull/5709 
> `monotonicallyIncreasingID()` allows to create an index column with unique 
> ids. The indexing always starts at 0 (no offset).
> Feature wish: Having a parameter `offset`, such that the function can be used 
> as
> {{monotonicallyIncreasingID( offset )}}
> and indexing *starts at `offset` instead of 0*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7135) Expression for monotonically increasing IDs

2015-09-29 Thread Martin Senne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934947#comment-14934947
 ] 

Martin Senne commented on SPARK-7135:
-

@[~rxin]: First, great feature. Unfortunately, there seems to be no way to set 
an offset, where to start indexing at (currently indexing starts always at 0). 
Is there any chance, to add this?

As this is my first post here, I'm not sure if this place here. Should I raise 
a feature request here in Jira?

> Expression for monotonically increasing IDs
> ---
>
> Key: SPARK-7135
> URL: https://issues.apache.org/jira/browse/SPARK-7135
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: dataframe
> Fix For: 1.4.0
>
>
> Seems like a common use case that users might want a unique ID for each row. 
> It is more expensive to have consecutive IDs, since that'd require two pass 
> over the data. However, many use cases can be satisfied by just having unique 
> ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10868) monotonicallyIncreasingId() supports offset to start indexing at

2015-09-29 Thread Martin Senne (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Senne updated SPARK-10868:
-
Description: 
With SPARK-7135 and https://github.com/apache/spark/pull/5709 
`monotonicallyIncreasingID()` allows to create an index column with unique ids. 
The indexing always starts at 0 (no offset).

Feature wish: Having a parameter `offset`, such that the function can be used as

monotonicallyIncreasingID( offset )

and indexing *starts at `offset` instead of 0*.



  was:
With SPARK-7135 and https://github.com/apache/spark/pull/5709 
`monotonicallyIncreasingID()` allows to create an index column with unique ids. 
The indexing always starts at 0 (no offset).

Feature wish: Having a parameter `offset`, such that the function can be used as

monotonicallyIncreasingID( offset )

and indexing starts at `offset` instead of 0.




> monotonicallyIncreasingId() supports offset to start indexing at
> 
>
> Key: SPARK-10868
> URL: https://issues.apache.org/jira/browse/SPARK-10868
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Martin Senne
>
> With SPARK-7135 and https://github.com/apache/spark/pull/5709 
> `monotonicallyIncreasingID()` allows to create an index column with unique 
> ids. The indexing always starts at 0 (no offset).
> Feature wish: Having a parameter `offset`, such that the function can be used 
> as
> monotonicallyIncreasingID( offset )
> and indexing *starts at `offset` instead of 0*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10868) monotonicallyIncreasingId() supports offset to start indexing at

2015-09-29 Thread Martin Senne (JIRA)
Martin Senne created SPARK-10868:


 Summary: monotonicallyIncreasingId() supports offset to start 
indexing at
 Key: SPARK-10868
 URL: https://issues.apache.org/jira/browse/SPARK-10868
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.0
Reporter: Martin Senne


With SPARK-7135 and https://github.com/apache/spark/pull/5709 
`monotonicallyIncreasingID()` allows to create an index column with unique ids. 
The indexing always starts at 0 (no offset).

Feature wish: Having a parameter `offset`, such that the function can be used as

monotonicallyIncreasingID( offset )

and indexing starts at `offset` instead of 0.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10490) Consolidate the Cholesky solvers in WeightedLeastSquares and ALS

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934897#comment-14934897
 ] 

Apache Spark commented on SPARK-10490:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8936

> Consolidate the Cholesky solvers in WeightedLeastSquares and ALS
> 
>
> Key: SPARK-10490
> URL: https://issues.apache.org/jira/browse/SPARK-10490
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>
> There are two Cholesky solvers in WeightedLeastSquares and ALS, we should 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10490) Consolidate the Cholesky solvers in WeightedLeastSquares and ALS

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10490:


Assignee: Apache Spark

> Consolidate the Cholesky solvers in WeightedLeastSquares and ALS
> 
>
> Key: SPARK-10490
> URL: https://issues.apache.org/jira/browse/SPARK-10490
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> There are two Cholesky solvers in WeightedLeastSquares and ALS, we should 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10490) Consolidate the Cholesky solvers in WeightedLeastSquares and ALS

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10490:


Assignee: (was: Apache Spark)

> Consolidate the Cholesky solvers in WeightedLeastSquares and ALS
> 
>
> Key: SPARK-10490
> URL: https://issues.apache.org/jira/browse/SPARK-10490
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>
> There are two Cholesky solvers in WeightedLeastSquares and ALS, we should 
> merge them into one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

2015-09-29 Thread Hans van den Bogert (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934952#comment-14934952
 ] 

Hans van den Bogert commented on SPARK-10474:
-

I should note that I can reproduce this easily with Mesos fine-grained mode. 
However same environment, same query, but Mesos coarse-grained, does not show 
this error (I tried 3 times). Does anyone have an idea which difference between 
fine-grained and coarse-grained can cause this?

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -
>
> Key: SPARK-10474
> URL: https://issues.apache.org/jira/browse/SPARK-10474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING 

[jira] [Commented] (SPARK-10736) Use 1 for all ratings if $(ratingCol) = ""

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934985#comment-14934985
 ] 

Apache Spark commented on SPARK-10736:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8937

> Use 1 for all ratings if $(ratingCol) = ""
> --
>
> Key: SPARK-10736
> URL: https://issues.apache.org/jira/browse/SPARK-10736
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> For some implicit dataset, ratings may not exist in the training data. In 
> this case, we can assume all observed pairs to be positive and treat their 
> ratings as 1. This should happen when users set ratingCol to an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10736) Use 1 for all ratings if $(ratingCol) = ""

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10736:


Assignee: (was: Apache Spark)

> Use 1 for all ratings if $(ratingCol) = ""
> --
>
> Key: SPARK-10736
> URL: https://issues.apache.org/jira/browse/SPARK-10736
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> For some implicit dataset, ratings may not exist in the training data. In 
> this case, we can assume all observed pairs to be positive and treat their 
> ratings as 1. This should happen when users set ratingCol to an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10736) Use 1 for all ratings if $(ratingCol) = ""

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10736:


Assignee: Apache Spark

> Use 1 for all ratings if $(ratingCol) = ""
> --
>
> Key: SPARK-10736
> URL: https://issues.apache.org/jira/browse/SPARK-10736
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>
> For some implicit dataset, ratings may not exist in the training data. In 
> this case, we can assume all observed pairs to be positive and treat their 
> ratings as 1. This should happen when users set ratingCol to an empty string.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10871) Specify number of failed executors in ApplicationMaster error message

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935278#comment-14935278
 ] 

Apache Spark commented on SPARK-10871:
--

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/8939

> Specify number of failed executors in ApplicationMaster error message
> -
>
> Key: SPARK-10871
> URL: https://issues.apache.org/jira/browse/SPARK-10871
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> I ran in to 
> [this|https://github.com/apache/spark/blob/9b9fe5f7bf55257269d8febcd64e95677075dfb6/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L346-L348]
>  error message today while debugging a failed app:
> {code}
> 15/09/29 00:33:20 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 11, (reason: Max number of executor failures reached)
> 15/09/29 00:33:23 INFO util.ShutdownHookManager: Shutdown hook called
> {code}
> This app ran with dynamic allocation and I'm not sure what limit was used as 
> the "maximum allowable number of failed executors"; in any case, the error 
> message may as well specify this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10870) Criteo Display Advertising Challenge dataset

2015-09-29 Thread Peter Rudenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Rudenko updated SPARK-10870:
--
Description: 
Very useful dataset to test pipeline because of:
# "Big data" dataset - original Kaggle competition dataset is 12 gb, but 
there's [1tb|http://labs.criteo.com/downloads/download-terabyte-click-logs/] 
dataset of the same schema as well.
# Sparse models - categorical features has high cardinality
# Reproducible results - because it's public and many other distributed machine 
learning libraries (e.g. 
[wormwhole|https://github.com/dmlc/wormhole/blob/master/doc/tutorial/criteo_kaggle.rst],
 [parameter 
server|https://github.com/dmlc/parameter_server/blob/master/example/linear/criteo/README.md],
 [azure 
ml|https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-process-hive-criteo-walkthrough/#mltasks]
 etc.) have made a base line benchmarks on which we could compare.


I have some base line results with custom models (GBDT encoders and tuned LR) 
on spark-1.4. Will make pipelines using public spark model. [Winning 
solution|http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf] used 
GBDT encoder (not available in spark, but not difficult to make one from GBT 
from mllib) + hashing + factorization machine (planned for spark-1.6).

  was:
Very useful dataset to test pipeline because of:
# "Big data" dataset - original Kaggle competition dataset is 12 gb, but 
there's [1tb|http://labs.criteo.com/downloads/download-terabyte-click-logs/] 
dataset of the same schema as well.
# Sparse models - categorical features has high cardinality
# Reproducible results - because it's public and many other distributed machine 
learning libraries (e.g. 
[wormwhole|https://github.com/dmlc/wormhole/blob/master/doc/tutorial/criteo_kaggle.rst],
 [parameter 
server|https://github.com/dmlc/parameter_server/blob/master/example/linear/criteo/README.md]
 etc.) have made a base line benchmarks on which we could compare.


I have some base line results with custom models (GBDT encoders and tuned LR) 
on spark-1.4. Will make pipelines using public spark model. [Winning 
solution|http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf] used 
GBDT encoder (not available in spark, but not difficult to make one from GBT 
from mllib) + hashing + factorization machine (planned for spark-1.6).


> Criteo Display Advertising Challenge dataset
> 
>
> Key: SPARK-10870
> URL: https://issues.apache.org/jira/browse/SPARK-10870
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Peter Rudenko
>
> Very useful dataset to test pipeline because of:
> # "Big data" dataset - original Kaggle competition dataset is 12 gb, but 
> there's [1tb|http://labs.criteo.com/downloads/download-terabyte-click-logs/] 
> dataset of the same schema as well.
> # Sparse models - categorical features has high cardinality
> # Reproducible results - because it's public and many other distributed 
> machine learning libraries (e.g. 
> [wormwhole|https://github.com/dmlc/wormhole/blob/master/doc/tutorial/criteo_kaggle.rst],
>  [parameter 
> server|https://github.com/dmlc/parameter_server/blob/master/example/linear/criteo/README.md],
>  [azure 
> ml|https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-process-hive-criteo-walkthrough/#mltasks]
>  etc.) have made a base line benchmarks on which we could compare.
> I have some base line results with custom models (GBDT encoders and tuned LR) 
> on spark-1.4. Will make pipelines using public spark model. [Winning 
> solution|http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf] used 
> GBDT encoder (not available in spark, but not difficult to make one from GBT 
> from mllib) + hashing + factorization machine (planned for spark-1.6).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10870) Criteo Display Advertising Challenge dataset

2015-09-29 Thread Peter Rudenko (JIRA)
Peter Rudenko created SPARK-10870:
-

 Summary: Criteo Display Advertising Challenge dataset
 Key: SPARK-10870
 URL: https://issues.apache.org/jira/browse/SPARK-10870
 Project: Spark
  Issue Type: Sub-task
Reporter: Peter Rudenko


Very useful dataset to test pipeline because of:
# "Big data" dataset - original Kaggle competition dataset is 12 gb, but 
there's [1tb|http://labs.criteo.com/downloads/download-terabyte-click-logs/] 
dataset of the same schema as well.
# Sparse models - categorical features has high cardinality
# Reproducible results - because it's public and many other distributed machine 
learning libraries (e.g. 
[wormwhole|https://github.com/dmlc/wormhole/blob/master/doc/tutorial/criteo_kaggle.rst],
 [parameter 
server|https://github.com/dmlc/parameter_server/blob/master/example/linear/criteo/README.md]
 etc.) have made a base line benchmarks on which we could compare.


I have some base line results with custom models (GBDT encoders and tuned LR) 
on spark-1.4. Will make pipelines using public spark model. [Winning 
solution|http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf] used 
GBDT encoder (not available in spark, but not difficult to make one from GBT 
from mllib) + hashing + factorization machine (planned for spark-1.6).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10871) Specify number of failed executors in ApplicationMaster error message

2015-09-29 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-10871:
-

 Summary: Specify number of failed executors in ApplicationMaster 
error message
 Key: SPARK-10871
 URL: https://issues.apache.org/jira/browse/SPARK-10871
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Ryan Williams
Priority: Minor


I ran in to 
[this|https://github.com/apache/spark/blob/9b9fe5f7bf55257269d8febcd64e95677075dfb6/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L346-L348]
 error message today while debugging a failed app:

{code}
15/09/29 00:33:20 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 11, (reason: Max number of executor failures reached)
15/09/29 00:33:23 INFO util.ShutdownHookManager: Shutdown hook called
{code}

This app ran with dynamic allocation and I'm not sure what limit was used as 
the "maximum allowable number of failed executors"; in any case, the error 
message may as well specify this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10870) Criteo Display Advertising Challenge

2015-09-29 Thread Peter Rudenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Rudenko updated SPARK-10870:
--
Summary: Criteo Display Advertising Challenge  (was: Criteo Display 
Advertising Challenge dataset)

> Criteo Display Advertising Challenge
> 
>
> Key: SPARK-10870
> URL: https://issues.apache.org/jira/browse/SPARK-10870
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Peter Rudenko
>
> Very useful dataset to test pipeline because of:
> # "Big data" dataset - original Kaggle competition dataset is 12 gb, but 
> there's [1tb|http://labs.criteo.com/downloads/download-terabyte-click-logs/] 
> dataset of the same schema as well.
> # Sparse models - categorical features has high cardinality
> # Reproducible results - because it's public and many other distributed 
> machine learning libraries (e.g. 
> [wormwhole|https://github.com/dmlc/wormhole/blob/master/doc/tutorial/criteo_kaggle.rst],
>  [parameter 
> server|https://github.com/dmlc/parameter_server/blob/master/example/linear/criteo/README.md],
>  [azure 
> ml|https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-process-hive-criteo-walkthrough/#mltasks]
>  etc.) have made a base line benchmarks on which we could compare.
> I have some base line results with custom models (GBDT encoders and tuned LR) 
> on spark-1.4. Will make pipelines using public spark model. [Winning 
> solution|http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf] used 
> GBDT encoder (not available in spark, but not difficult to make one from GBT 
> from mllib) + hashing + factorization machine (planned for spark-1.6).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10871) Specify number of failed executors in ApplicationMaster error message

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10871:


Assignee: Apache Spark

> Specify number of failed executors in ApplicationMaster error message
> -
>
> Key: SPARK-10871
> URL: https://issues.apache.org/jira/browse/SPARK-10871
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Assignee: Apache Spark
>Priority: Minor
>
> I ran in to 
> [this|https://github.com/apache/spark/blob/9b9fe5f7bf55257269d8febcd64e95677075dfb6/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L346-L348]
>  error message today while debugging a failed app:
> {code}
> 15/09/29 00:33:20 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 11, (reason: Max number of executor failures reached)
> 15/09/29 00:33:23 INFO util.ShutdownHookManager: Shutdown hook called
> {code}
> This app ran with dynamic allocation and I'm not sure what limit was used as 
> the "maximum allowable number of failed executors"; in any case, the error 
> message may as well specify this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10871) Specify number of failed executors in ApplicationMaster error message

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10871:


Assignee: (was: Apache Spark)

> Specify number of failed executors in ApplicationMaster error message
> -
>
> Key: SPARK-10871
> URL: https://issues.apache.org/jira/browse/SPARK-10871
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> I ran in to 
> [this|https://github.com/apache/spark/blob/9b9fe5f7bf55257269d8febcd64e95677075dfb6/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L346-L348]
>  error message today while debugging a failed app:
> {code}
> 15/09/29 00:33:20 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 11, (reason: Max number of executor failures reached)
> 15/09/29 00:33:23 INFO util.ShutdownHookManager: Shutdown hook called
> {code}
> This app ran with dynamic allocation and I'm not sure what limit was used as 
> the "maximum allowable number of failed executors"; in any case, the error 
> message may as well specify this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Genini updated SPARK-10869:
--
Description: 
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{code:json}
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}
{code}


  was:
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}




> Auto-normalization of semi-structured schema from a dataframe
> -
>
> Key: SPARK-10869
> URL: https://issues.apache.org/jira/browse/SPARK-10869
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Julien Genini
>  

[jira] [Created] (SPARK-10874) add Search box to History Page

2015-09-29 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-10874:
-

 Summary: add Search box to History Page
 Key: SPARK-10874
 URL: https://issues.apache.org/jira/browse/SPARK-10874
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.5.1
Reporter: Thomas Graves


Its hard to navigate the history server. It would be really nice to have a 
search box to look for just the applications you are interested in. It should 
search all columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935371#comment-14935371
 ] 

Thomas Graves commented on SPARK-10873:
---

Note if there isn't a jira filed to fix the issue we should use this one.

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10873) can't sort columns on history page

2015-09-29 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-10873:
-

 Summary: can't sort columns on history page
 Key: SPARK-10873
 URL: https://issues.apache.org/jira/browse/SPARK-10873
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.1
Reporter: Thomas Graves


Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10172) History Server web UI gets messed up when sorting on any column

2015-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935369#comment-14935369
 ] 

Thomas Graves commented on SPARK-10172:
---

Was there another jira filed to really fix this problem rather then just 
disabling it?   I think this is really confusing to users to lose this and its 
hard enough to navigate history server.

> History Server web UI gets messed up when sorting on any column
> ---
>
> Key: SPARK-10172
> URL: https://issues.apache.org/jira/browse/SPARK-10172
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Min Shen
>Assignee: Josiah Samuel Sathiadass
>Priority: Minor
>  Labels: regression
> Fix For: 1.5.1, 1.6.0
>
> Attachments: screen-shot.png
>
>
> If the history web UI displays the "Attempt ID" column, when clicking the 
> table header to sort on any column, the entire page gets messed up.
> This seems to be a problem with the sorttable.js not able to correctly handle 
> tables with rowspan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1493) Apache RAT excludes don't work with file path (instead of file name)

2015-09-29 Thread Sandhya Sundaresan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935437#comment-14935437
 ] 

Sandhya Sundaresan commented on SPARK-1493:
---

Erik's proposal would certainly help us in the Trafodion project. We have a 
high level directory sturcture on which we'd like to run RAT. We have several 
subdirectories and want to include only a subset of files in each of those 
directories. Using filename patterns without /path would make the exclusions 
imprecise. There may be the same file it 2 different subdirectories and both 
would get excluded which we don't really intend to do.We want to exclude 
specific files in specific directories with a regexp. Unfortunately  it isn't 
working.  


> Apache RAT excludes don't work with file path (instead of file name)
> 
>
> Key: SPARK-1493
> URL: https://issues.apache.org/jira/browse/SPARK-1493
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Patrick Wendell
>  Labels: starter
>
> Right now the way we do RAT checks, it doesn't work if you try to exclude:
> /path/to/file.ext
> you have to just exclude
> file.ext



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Genini updated SPARK-10869:
--
Description: 
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{code}
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}
{code}


  was:
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{code:json}
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}
{code}



> Auto-normalization of semi-structured schema from a dataframe
> -
>
> Key: SPARK-10869
> URL: https://issues.apache.org/jira/browse/SPARK-10869
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: 

[jira] [Closed] (SPARK-6508) error handling issue running python in yarn cluster mode

2015-09-29 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves closed SPARK-6508.

Resolution: Duplicate

> error handling issue running python in yarn cluster mode 
> -
>
> Key: SPARK-6508
> URL: https://issues.apache.org/jira/browse/SPARK-6508
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Thomas Graves
>
> I was running python in yarn cluster mode and didn't have the SPARK_HOME 
> envirnoment variables set.  The client reported a failure of: 
> ava.io.FileNotFoundException: File does not exist: 
> hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/tgraves/.sparkStaging/application_1425530846697_59578/pi.py
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1201)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:1193)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>
> But when you look at the application master log:
> Log Contents:
> Traceback (most recent call last):
>   File "pi.py", line 29, in 
> sc = SparkContext(appName="PythonPi")
>   File 
> "/grid/11/tmp/yarn-local/usercache/tgraves/filecache/37/spark-assembly-1.3.0.0-hadoop2.6.0.6.1502061521.jar/pyspark/context.py",
>  line 108, in __init__
>   File 
> "/grid/11/tmp/yarn-local/usercache/tgraves/filecache/37/spark-assembly-1.3.0.0-hadoop2.6.0.6.1502061521.jar/pyspark/context.py",
>  line 222, in _ensure_initialized
>   File 
> "/grid/11/tmp/yarn-local/usercache/tgraves/filecache/37/spark-assembly-1.3.0.0-hadoop2.6.0.6.1502061521.jar/pyspark/java_gateway.py",
>  line 32, in launch_gateway
>   File "/usr/lib64/python2.6/UserDict.py", line 22, in __getitem__
> raise KeyError(key)
> KeyError: 'SPARK_HOME'
> But the application master thought it succeeded and removed the staging 
> directory when it shouldn't have:
> 15/03/24 14:50:07 INFO yarn.ApplicationMaster: Waiting for spark context 
> initialization ... 
> 15/03/24 14:50:08 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 15/03/24 14:50:08 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
> final status was reported.)
> 15/03/24 14:50:08 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1425530846697_59578



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935370#comment-14935370
 ] 

Thomas Graves commented on SPARK-10873:
---

evidentally this was explicitly disabled by SPARK-10172.  This is very annoying 
and confusing.  

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large

2015-09-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935359#comment-14935359
 ] 

Thomas Graves commented on SPARK-6951:
--

I'm still seeing history server startup really slow in 1.5.1... SPARK-5522 may 
have helped although I don't see it showing  list of application names in the 
UI without the data from those so not sure if that pr did what description says.

It would be really nice is it could get metadata without having to parse the 
entire log. 

> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2015-09-29 Thread Dmytro Bielievtsov (JIRA)
Dmytro Bielievtsov created SPARK-10872:
--

 Summary: Derby error (XSDB6) when creating new HiveContext after 
restarting SparkContext
 Key: SPARK-10872
 URL: https://issues.apache.org/jira/browse/SPARK-10872
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0, 1.4.1, 1.4.0
Reporter: Dmytro Bielievtsov


Starting from spark 1.3.1, the following code fails with "XSDB6: Another 
instance of Derby may have already booted the database ~/metastore_db":

{code:python}
from pyspark import SparkContext, HiveContext
sc = SparkContext("local[*]", "app1")
sql = HiveContext(sc)
sql.createDataFrame([[1]]).collect()
sc.stop()
sc = SparkContext("local[*]", "app2")
sql = HiveContext(sc)
sql.createDataFrame([[1]]).collect()  # Py4J error
{code}

This is related to [#SPARK-9539], and I intend to restart spark context several 
times for isolated jobs to prevent cache cluttering and GC errors.

Here's a larger part of the full error trace:
{noformat}
Failed to start database 'metastore_db' with class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see the 
next exception for details.
org.datanucleus.exceptions.NucleusDataStoreException: Failed to start database 
'metastore_db' with class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see the 
next exception for details.
at 
org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
at 
org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at 
org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
at 
org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at 
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
at 
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
at 
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at 

[jira] [Updated] (SPARK-10869) Auto-normalization of semi-structured schema from a dataframe

2015-09-29 Thread Julien Genini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Genini updated SPARK-10869:
--
Description: 
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{'fields': [{'fullPathName': 'SiteXML.BusinessDate',
 'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Group',
 'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.Id_Site',
 'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle',
 'metadata': {},
 'name': 'libelle',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.Site_List.Site.libelle_Group',
 'metadata': {},
 'name': 'libelle_Group',
 'nullable': True,
 'type': 'string'},
{'fullPathName': 'SiteXML.TimeStamp',
 'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'type': 'string'}],
 'nodes': [{'fieldsFullPathName': ['SiteXML.BusinessDate',
   'SiteXML.TimeStamp'],
'fullPathName': 'SiteXML',
'nbFields': 2},
   {'fieldsFullPathName': ['SiteXML.Site_List.Site.Id_Group',
   'SiteXML.Site_List.Site.Id_Site',
   'SiteXML.Site_List.Site.libelle',
   'SiteXML.Site_List.Site.libelle_Group'],
'fullPathName': 'SiteXML.Site_List.Site',
'nbFields': 4}]}



  was:
today, you can get a multi-depth schema from a semi-structured dataframe. (XML, 
JSON, etc..)
Not so easy to deal in data warehousing where it's better to normalize the data.

I propose an option to add when you get the schema (normalized, default False)
Then the returned json schema will contains the normalized path for each field, 
and the list of the different node levels

df = sqlContext.read.json(jsonPath)
jsonLinearSchema = df.schema.jsonValue(normalized=True)

>>
{'fields': [{'metadata': {},
 'name': 'BusinessDate',
 'nullable': True,
 'pathName': 'SiteXML.BusinessDate',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Group',
 'type': 'string'},
{'metadata': {},
 'name': 'Id_Site',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.Id_Site',
 'type': 'string'},
{'metadata': {},
 'name': 'label',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label',
 'type': 'string'},
{'metadata': {},
 'name': 'label_group',
 'nullable': True,
 'pathName': 'SiteXML.Site_List.Site.label_group',
 'type': 'string'},
{'metadata': {},
 'name': 'TimeStamp',
 'nullable': True,
 'pathName': 'SiteXML.TimeStamp',
 'type': 'string'}],
 'nodes': [{'name': '', 'nbFields': 3},
   {'name': 'SiteXML', 'nbFields': 1},
   {'name': 'SiteXML.Site_List', 'nbFields': 0},
   {'name': 'SiteXML.Site_List.Site', 'nbFields': 4}]}





> Auto-normalization of semi-structured schema from a dataframe
> -
>
> Key: SPARK-10869
> URL: https://issues.apache.org/jira/browse/SPARK-10869
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Julien Genini
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> today, you can get a multi-depth schema from a semi-structured dataframe. 
> (XML, JSON, etc..)
> Not so easy to deal in data warehousing where it's better to normalize the 
> data.
> I propose an option to add when you get the schema (normalized, default False)
> Then the returned 

[jira] [Commented] (SPARK-8734) Expose all Mesos DockerInfo options to Spark

2015-09-29 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935352#comment-14935352
 ] 

Alan Braithwaite commented on SPARK-8734:
-

I like this method and it's consistent with most other ways of handling lists 
in property files.

> Expose all Mesos DockerInfo options to Spark
> 
>
> Key: SPARK-8734
> URL: https://issues.apache.org/jira/browse/SPARK-8734
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Chris Heller
>Priority: Minor
> Attachments: network.diff
>
>
> SPARK-2691 only exposed a few options from the DockerInfo message. It would 
> be reasonable to expose them all, especially given one can now specify 
> arbitrary parameters to docker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10877) Assertions fail straightforward DataFrame job

2015-09-29 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-10877:
--

 Summary: Assertions fail straightforward DataFrame job
 Key: SPARK-10877
 URL: https://issues.apache.org/jira/browse/SPARK-10877
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Matt Cheah


I have some code that I’m running in a unit test suite, but the code I’m 
running is failing with an assertion error.

I have translated the JUnit test that was failing, to a Scala script that I 
will attach to the ticket. The assertion error is the following:

{code}
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
lengthInBytes must be a multiple of 8 (word-aligned)
at 
org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
at 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
at 
org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
{code}

However, it turns out that this code actually works normally and computes the 
correct result if assertions are turned off.

I traced the code and found that when hashUnsafeWords was called, it was given 
a byte-length of 12, which clearly is not a multiple of 8. However, the job 
seems to compute correctly regardless of this fact. Of course, I can’t just 
disable assertions for my unit test though.

A few things we need to understand:

1. Why is the lengthInBytes of size 12?
2. Is it actually a problem that the byte length is not word-aligned? If so, 
how should we fix the byte length? If it's not a problem, why is the assertion 
flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-09-29 Thread Matt Cheah (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-10877:
---
Summary: Assertions fail straightforward DataFrame job due to word 
alignment  (was: Assertions fail straightforward DataFrame job)

> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-09-29 Thread Matt Cheah (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-10877:
---
Attachment: SparkFilterByKeyTest.scala

I've attached the Scala script that manifests the problem on my machine.

> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
> Attachments: SparkFilterByKeyTest.scala
>
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6919) Add .asDict method to StatCounter

2015-09-29 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-6919.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 5516
[https://github.com/apache/spark/pull/5516]

> Add .asDict method to StatCounter
> -
>
> Key: SPARK-6919
> URL: https://issues.apache.org/jira/browse/SPARK-6919
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Erik Shilts
>Priority: Minor
> Fix For: 1.6.0
>
>
> Adds an `.asDict` method to the StatCounter object instance in PySpark. This 
> will make it easier to parse a call to `.stats()`.
> For now this affects only PySpark, but if desired I can add an `.asMap` 
> method to the Scala version as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10875) RowMatrix.computeCovariance() result is not exactly symmetric

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10875:


Assignee: (was: Apache Spark)

> RowMatrix.computeCovariance() result is not exactly symmetric
> -
>
> Key: SPARK-10875
> URL: https://issues.apache.org/jira/browse/SPARK-10875
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>Priority: Minor
>
> For some matrices, I have seen that the computed covariance matrix is not 
> exactly symmetric, most likely due to some numerical rounding errors. This is 
> problematic when trying to construct an instance of {{MultivariateGaussian}}, 
> because it requires an exactly symmetric covariance matrix. See reproducible 
> example below.
> I would suggest modifying the implementation so that {{G(i, j)}} and {{G(j, 
> i)}} are set at the same time, with the same value.
> {code}
> val rdd = RandomRDDs.normalVectorRDD(sc, 100, 10, 0, 0)
> val matrix = new RowMatrix(rdd)
> val mean = matrix.computeColumnSummaryStatistics().mean
> val cov = matrix.computeCovariance()
> val dist = new MultivariateGaussian(mean, cov) //throws 
> breeze.linalg.MatrixNotSymmetricException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7245) Spearman correlation for DataFrames

2015-09-29 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936059#comment-14936059
 ] 

Narine Kokhlikyan commented on SPARK-7245:
--

Thanks for your comment ~Xiangrui Meng
Let me know if you need any help.

Thanks,
Narine 

> Spearman correlation for DataFrames
> ---
>
> Key: SPARK-7245
> URL: https://issues.apache.org/jira/browse/SPARK-7245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>
> Spearman correlation is harder than Pearson to compute.
> ~~~
> df.stat.corr(col1, col2, method="spearman"): Double
> ~~~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9835) Iteratively reweighted least squares solver for GLMs

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935827#comment-14935827
 ] 

Apache Spark commented on SPARK-9835:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7721

> Iteratively reweighted least squares solver for GLMs
> 
>
> Key: SPARK-9835
> URL: https://issues.apache.org/jira/browse/SPARK-9835
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-9834, we can implement iteratively reweighted least squares 
> (IRLS) solver for GLMs with other families and link functions. It could 
> provide R-like summary statistics after training, but the number of features 
> cannot be very large, e.g. more than 4096.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9835) Iteratively reweighted least squares solver for GLMs

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9835:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

> Iteratively reweighted least squares solver for GLMs
> 
>
> Key: SPARK-9835
> URL: https://issues.apache.org/jira/browse/SPARK-9835
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-9834, we can implement iteratively reweighted least squares 
> (IRLS) solver for GLMs with other families and link functions. It could 
> provide R-like summary statistics after training, but the number of features 
> cannot be very large, e.g. more than 4096.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9835) Iteratively reweighted least squares solver for GLMs

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9835:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

> Iteratively reweighted least squares solver for GLMs
> 
>
> Key: SPARK-9835
> URL: https://issues.apache.org/jira/browse/SPARK-9835
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> After SPARK-9834, we can implement iteratively reweighted least squares 
> (IRLS) solver for GLMs with other families and link functions. It could 
> provide R-like summary statistics after training, but the number of features 
> cannot be very large, e.g. more than 4096.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10671) Calling a UDF with insufficient number of input arguments should throw an analysis error

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935857#comment-14935857
 ] 

Apache Spark commented on SPARK-10671:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8941

> Calling a UDF with insufficient number of input arguments should throw an 
> analysis error
> 
>
> Key: SPARK-10671
> URL: https://issues.apache.org/jira/browse/SPARK-10671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> {code}
> import org.apache.spark.sql.functions._
> Seq((1,2)).toDF("a", "b").select(callUDF("percentile", $"a"))
> {code}
> This should throws an Analysis Exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10058) Flaky test: HeartbeatReceiverSuite: normal heartbeat

2015-09-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10058:
--
Assignee: Shixiong Zhu  (was: Andrew Or)

> Flaky test: HeartbeatReceiverSuite: normal heartbeat
> 
>
> Key: SPARK-10058
> URL: https://issues.apache.org/jira/browse/SPARK-10058
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>Priority: Critical
>  Labels: flaky-test
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/
> {code}
> Error Message
> 3 did not equal 2
> Stacktrace
> sbt.ForkMain$ForkError: 3 did not equal 2
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply$mcV$sp(HeartbeatReceiverSuite.scala:104)
>   at 
> org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97)
>   at 
> org.apache.spark.HeartbeatReceiverSuite$$anonfun$2.apply(HeartbeatReceiverSuite.scala:97)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.runTest(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.org$scalatest$BeforeAndAfterAll$$super$run(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at 
> org.apache.spark.HeartbeatReceiverSuite.run(HeartbeatReceiverSuite.scala:41)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> 

[jira] [Updated] (SPARK-10782) Duplicate examples for drop_duplicates and DropDuplicates

2015-09-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10782:
--
Assignee: Asoka Diggs

> Duplicate examples for drop_duplicates and DropDuplicates
> -
>
> Key: SPARK-10782
> URL: https://issues.apache.org/jira/browse/SPARK-10782
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Asoka Diggs
>Assignee: Asoka Diggs
>Priority: Trivial
> Fix For: 1.6.0
>
>
> In documentation for pyspark.sql, the source code examples for DropDuplicates 
> and drop_duplicates are identical with each other.  It appears that the 
> example for DropDuplicates was copy/pasted for drop_duplicates and not edited.
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10875) RowMatrix.computeCovariance() result is not exactly symmetric

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10875:


Assignee: Apache Spark

> RowMatrix.computeCovariance() result is not exactly symmetric
> -
>
> Key: SPARK-10875
> URL: https://issues.apache.org/jira/browse/SPARK-10875
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>Assignee: Apache Spark
>Priority: Minor
>
> For some matrices, I have seen that the computed covariance matrix is not 
> exactly symmetric, most likely due to some numerical rounding errors. This is 
> problematic when trying to construct an instance of {{MultivariateGaussian}}, 
> because it requires an exactly symmetric covariance matrix. See reproducible 
> example below.
> I would suggest modifying the implementation so that {{G(i, j)}} and {{G(j, 
> i)}} are set at the same time, with the same value.
> {code}
> val rdd = RandomRDDs.normalVectorRDD(sc, 100, 10, 0, 0)
> val matrix = new RowMatrix(rdd)
> val mean = matrix.computeColumnSummaryStatistics().mean
> val cov = matrix.computeCovariance()
> val dist = new MultivariateGaussian(mean, cov) //throws 
> breeze.linalg.MatrixNotSymmetricException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10782) Duplicate examples for drop_duplicates and DropDuplicates

2015-09-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10782.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8930
[https://github.com/apache/spark/pull/8930]

> Duplicate examples for drop_duplicates and DropDuplicates
> -
>
> Key: SPARK-10782
> URL: https://issues.apache.org/jira/browse/SPARK-10782
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Asoka Diggs
>Priority: Trivial
> Fix For: 1.6.0
>
>
> In documentation for pyspark.sql, the source code examples for DropDuplicates 
> and drop_duplicates are identical with each other.  It appears that the 
> example for DropDuplicates was copy/pasted for drop_duplicates and not edited.
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.dropDuplicates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10875) RowMatrix.computeCovariance() result is not exactly symmetric

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935837#comment-14935837
 ] 

Apache Spark commented on SPARK-10875:
--

User 'pnpritchard' has created a pull request for this issue:
https://github.com/apache/spark/pull/8940

> RowMatrix.computeCovariance() result is not exactly symmetric
> -
>
> Key: SPARK-10875
> URL: https://issues.apache.org/jira/browse/SPARK-10875
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>Priority: Minor
>
> For some matrices, I have seen that the computed covariance matrix is not 
> exactly symmetric, most likely due to some numerical rounding errors. This is 
> problematic when trying to construct an instance of {{MultivariateGaussian}}, 
> because it requires an exactly symmetric covariance matrix. See reproducible 
> example below.
> I would suggest modifying the implementation so that {{G(i, j)}} and {{G(j, 
> i)}} are set at the same time, with the same value.
> {code}
> val rdd = RandomRDDs.normalVectorRDD(sc, 100, 10, 0, 0)
> val matrix = new RowMatrix(rdd)
> val mean = matrix.computeColumnSummaryStatistics().mean
> val cov = matrix.computeCovariance()
> val dist = new MultivariateGaussian(mean, cov) //throws 
> breeze.linalg.MatrixNotSymmetricException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936317#comment-14936317
 ] 

Saisai Shao edited comment on SPARK-10858 at 9/30/15 5:49 AM:
--

So basically I think the problem is if we specify the scheme, do we need to 
guarantee the preserved chars to be escaped by our own. If we rely on Spark to 
handle this, I think we need to fix this issue.

Another interesting thing is that not sure why your result is different from 
mine.


was (Author: jerryshao):
So basically I think the problem is do we need to treat this name "xx#xx" as a 
legal name, if so we need to fix this behavior.

Another interesting thing is that not sure why your result is different from 
mine.

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-09-29 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936219#comment-14936219
 ] 

Seth Hendrickson commented on SPARK-10788:
--

[~josephkb] I'm interested in working on this issue, but I'm not sure I see the 
problem. Looking through ML RandomForest implementation I found that 
{{numBins}} for unordered features is {{def numUnorderedBins(arity: Int): Int = 
2 * ((1 << arity - 1) - 1)}} and that {{numSplits}} is just {{numBins / 2}}. 

In the 3 category example: {{numBins = 2 * (( 1 << (3 - 1)) - 1) = 6}} and so 
the number of splits considered is {{numSplits = 6 / 2 = 3}}. This seems to be 
the same as in the MLlib implementation. Perhaps I am overlooking something. 
I'd appreciate any feedback...

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Decision trees in spark.ml (RandomForest.scala) effectively creates a second 
> copy of each split. E.g., if there are 3 categories A, B, C, then we should 
> consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we also consider the 3 flipped splits:
> * B,C vs. A
> * C vs. A, B
> * B vs. A, C
> This means we communicate twice as much data as needed for these features.
> We should eliminate these duplicate splits within the spark.ml implementation 
> since the spark.mllib implementation will be removed before long (and will 
> instead call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10878) Race condition when resolving Maven coordinates via Ivy

2015-09-29 Thread Ryan Williams (JIRA)
Ryan Williams created SPARK-10878:
-

 Summary: Race condition when resolving Maven coordinates via Ivy
 Key: SPARK-10878
 URL: https://issues.apache.org/jira/browse/SPARK-10878
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Ryan Williams
Priority: Minor


I've recently been shell-scripting the creation of many concurrent 
Spark-on-YARN apps and observing a fraction of them to fail with what I'm 
guessing is a race condition in their Maven-coordinate resolution.

For example, I might spawn an app for each path in file {{paths}} with the 
following shell script:

{code}
cat paths | parallel "$SPARK_HOME/bin/spark-submit foo.jar {}"
{code}

When doing this, I observe some fraction of the spawned jobs to fail with 
errors like:

{code}
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
Exception in thread "main" java.lang.RuntimeException: problem during retrieve 
of org.apache.spark#spark-submit-parent: java.text.ParseException: failed to 
parse report: 
/hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
 Premature end of file.
at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249)
at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83)
at org.apache.ivy.Ivy.retrieve(Ivy.java:551)
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1006)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.text.ParseException: failed to parse report: 
/hpc/users/willir31/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml:
 Premature end of file.
at 
org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293)
at 
org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329)
at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118)
... 7 more
Caused by: org.xml.sax.SAXParseException; Premature end of file.
at 
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
{code}

The more apps I try to launch simultaneously, the greater fraction of them seem 
to fail with this or similar errors; a batch of ~10 will usually work fine, a 
batch of 15 will see a few failures, and a batch of ~60 will have dozens of 
failures.

[This gist shows 11 recent failures I 
observed|https://gist.github.com/ryan-williams/648bff70e518de0c7c84].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10800) Flaky test: org.apache.spark.deploy.StandaloneDynamicAllocationSuite

2015-09-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10800.
---
Resolution: Duplicate

> Flaky test: org.apache.spark.deploy.StandaloneDynamicAllocationSuite
> 
>
> Key: SPARK-10800
> URL: https://issues.apache.org/jira/browse/SPARK-10800
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Shixiong Zhu
>  Labels: flaky-test
>
> Saw several failures on master:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3622/HADOOP_PROFILE=hadoop-2.4,label=spark-test/testReport/junit/org.apache.spark.deploy/
> {code}
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite.dynamic allocation 
> default behavior
> Failing for the past 1 build (Since Failed#3622 )
> Took 0.12 sec.
> add description
> Error Message
> 1 did not equal 2
> Stacktrace
>   org.scalatest.exceptions.TestFailedException: 1 did not equal 2
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:500)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:466)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply$mcV$sp(StandaloneDynamicAllocationSuite.scala:78)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:73)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite$$anonfun$1.apply(StandaloneDynamicAllocationSuite.scala:73)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:42)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneDynamicAllocationSuite.scala:33)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite.runTest(StandaloneDynamicAllocationSuite.scala:33)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
>   at org.scalatest.Suite$class.run(Suite.scala:1424)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite.org$scalatest$BeforeAndAfterAll$$super$run(StandaloneDynamicAllocationSuite.scala:33)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
>   at 
> org.apache.spark.deploy.StandaloneDynamicAllocationSuite.run(StandaloneDynamicAllocationSuite.scala:33)
>   at 

[jira] [Commented] (SPARK-10515) When killing executor, the pending replacement executors will be lost

2015-09-29 Thread KaiXinXIaoLei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936306#comment-14936306
 ] 

KaiXinXIaoLei commented on SPARK-10515:
---

If the heartbeat receiver kills executors (and new ones are not registered to 
replace them), the idle timeout for the old executors will be lost (and then 
change a total number of executors requested by Driver), So new ones will be 
not to asked to replace them.
For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout 
before a new executor is asked to replace executor 1. Then driver kill executor 
2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), 
So AM doesn't allocate a executor to replace 1.

> When killing executor, the pending replacement executors will be lost
> -
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9472) Consistent hadoop config for streaming

2015-09-29 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936293#comment-14936293
 ] 

Russell Alexander Spitzer commented on SPARK-9472:
--

Any thoughts on back-porting this to previous lines? We just hit this trying to 
pass in hadoop variables via "spark.hadoop" and having them not register with 
the Streaming Context via getOrCreate.

> Consistent hadoop config for streaming
> --
>
> Key: SPARK-9472
> URL: https://issues.apache.org/jira/browse/SPARK-9472
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936317#comment-14936317
 ] 

Saisai Shao commented on SPARK-10858:
-

So basically I think the problem is do we need to treat this name "xx#xx" as a 
legal name, if so we need to fix this behavior.

Another interesting thing is that not sure why your result is different from 
mine.

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10030) Managed memory leak detected when cache table

2015-09-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-10030:
-
Comment: was deleted

(was: Such use case issue (maybe rather question) should go to user mailing 
list.

http://spark.apache.org/community.html)

> Managed memory leak detected when cache table
> -
>
> Key: SPARK-10030
> URL: https://issues.apache.org/jira/browse/SPARK-10030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: wangwei
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.5.1
>
>
> I test the lastest spark-1.5.0 in local, standalone, yarn mode and follow the 
> steps bellow, then errors occured.
> 1. create table cache_test(id int,  name string) stored as textfile ;
> 2. load data local inpath 
> 'SparkSource/sql/hive/src/test/resources/data/files/kv1.txt' into table 
> cache_test;
> 3. cache table test as select * from cache_test distribute by id;
> configuration:
> spark.driver.memory5g
> spark.executor.memory   28g
> spark.cores.max  21
> {code}
> 15/08/16 17:14:47 ERROR Executor: Managed memory leak detected; size = 
> 67108864 bytes, TID = 434
> 15/08/16 17:14:47 ERROR Executor: Exception in task 23.0 in stage 9.0 (TID 
> 434)
> java.util.NoSuchElementException: key not found: val_54
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.sql.columnar.compression.DictionaryEncoding$Encoder.compress(compressionSchemes.scala:258)
>   at 
> org.apache.spark.sql.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:110)
>   at 
> org.apache.spark.sql.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:87)
>   at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
>   at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1$$anonfun$next$2.apply(InMemoryColumnarTableScan.scala:153)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>   at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:153)
>   at 
> org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(InMemoryColumnarTableScan.scala:120)
>   at 
> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
>   at 
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
>   at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:46)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936313#comment-14936313
 ] 

Saisai Shao commented on SPARK-10858:
-

Hi [~tgraves], I tested again with Mac and Linux (centos), seems the behavior 
is different.

In Mac, 

if we use {{--jars my.jar#renamed.jar}}

this file path will be resolved to URI 
{{file:/Users/sshao/projects/apache-spark/my.jar%23renamed.jar}}

if we use {{--jars 
file:///Users/sshao/projects/apache-spark/my.jar#renamed.jar}}

this file path will be resolved to URI 
{{file:/Users/sshao/projects/apache-spark/my.jar#renamed.jar}}

This is done by Utils#resolveURI

{code}
  def resolveURI(path: String): URI = {
try {
  val uri = new URI(path)
  if (uri.getScheme() != null) {
return uri
  }
} catch {
  case e: URISyntaxException =>
}
new File(path).getAbsoluteFile().toURI()
  }
{code}

Where if scheme is not specified, this code will transform the file path into 
URI, the noted thing is that "#" will be translated into "%23" in this `toURI`.

In Centos:

both 

{{--jars my.jar#renamed.jar}} 

and 

{{--jars file:///Users/sshao/projects/apache-spark/my.jar#renamed.jar}} 

will be resolved to 
{{file:/Users/sshao/projects/apache-spark/my.jar#renamed.jar}} through 
Utils#resolveURI, obviously "#" is not escaped.

So in my test, both these two ways of using --jars are failed in Centos.

After digging into the Hadoop code RawLocalFileSystem#pathToFile:

{code}
  public File pathToFile(Path path) {
checkPath(path);
if (!path.isAbsolute()) {
  path = new Path(getWorkingDirectory(), path);
}
return new File(path.toUri().getPath());
  }
{code}

Here using `URI.getPath` to get file path will lead to different behavior if we 
do not escape "#" to "%23", which will treat the part after "#" as fragment, 
not path. So in Mac without specifying scheme is succeeded, whereas in Centos 
both two ways are failed.

But if we instead using 

{{--jars my.jar%23renamed.jar}} 

or 

{{--jars file:///path/to/my.jar%23renamed.jar}},

it can be succeeded in Centos.













> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-09-29 Thread Sen Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936268#comment-14936268
 ] 

Sen Fang commented on SPARK-:
-

Another idea is do something similar to F# TypeProvider approach: 
http://fsharp.github.io/FSharp.Data/
I haven't looked into this extensively just yet but as far as I understand this 
uses compile time macro to generate classes based on data sources. In that 
sense, it is slightly similar to protobuf where you generate Java class based 
on schema definition. This makes dataframe type safe at the very upstream. With 
a bit of IDE plugin, you will even able to have autocomplete and type check 
when you write code, which would be very nice. I'm not sure if it will be 
scalable to propagate these type information down stream (in aggregation or 
transformed dataframe) though. As I understand, the macro and type provider in 
Scala provides similar capabilities.

> RDD-like API on top of Catalyst/DataFrame
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> As a Spark user, I want an API that sits somewhere in the middle of the 
> spectrum so I can write most of my applications with that API, and yet it can 
> be optimized well by Spark to achieve performance and stability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10879) spark on yarn support priority option

2015-09-29 Thread Yun Zhao (JIRA)
Yun Zhao created SPARK-10879:


 Summary: spark on yarn support priority option
 Key: SPARK-10879
 URL: https://issues.apache.org/jira/browse/SPARK-10879
 Project: Spark
  Issue Type: Improvement
Reporter: Yun Zhao


Add a YARN-only option to spark-submit: *--priority PRIORITY* .The priority of 
your YARN application (Default: 0).

Add a property: *spark.yarn.priority*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10515) When killing executor, the pending replacement executors will be lost

2015-09-29 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-10515:
--
Target Version/s: 1.6.0
   Fix Version/s: 1.6.0

> When killing executor, the pending replacement executors will be lost
> -
>
> Key: SPARK-10515
> URL: https://issues.apache.org/jira/browse/SPARK-10515
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-10858:

Comment: was deleted

(was: The error I got in the failed case is the same as you mentioned above, 
I'm running on Mac OS with Hadoop 2.6.0.)

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10473) EventLog will loss message in the long-running security application

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10473:


Assignee: Apache Spark

> EventLog will loss message in the long-running security application
> ---
>
> Key: SPARK-10473
> URL: https://issues.apache.org/jira/browse/SPARK-10473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.0, 1.6.0
>Reporter: SaintBacchus
>Assignee: Apache Spark
>
> In the implementation of *EventLoggingListener* , there is only one 
> OutputStream writing event message to HDFS.
> But when the token of the *DFSClient* in the "OutputStream" was expired, the  
> "DFSClient" had no right to write the message and miss all the message behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10473) EventLog will loss message in the long-running security application

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10473:


Assignee: (was: Apache Spark)

> EventLog will loss message in the long-running security application
> ---
>
> Key: SPARK-10473
> URL: https://issues.apache.org/jira/browse/SPARK-10473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.0, 1.6.0
>Reporter: SaintBacchus
>
> In the implementation of *EventLoggingListener* , there is only one 
> OutputStream writing event message to HDFS.
> But when the token of the *DFSClient* in the "OutputStream" was expired, the  
> "DFSClient" had no right to write the message and miss all the message behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10473) EventLog will loss message in the long-running security application

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936270#comment-14936270
 ] 

Apache Spark commented on SPARK-10473:
--

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/8942

> EventLog will loss message in the long-running security application
> ---
>
> Key: SPARK-10473
> URL: https://issues.apache.org/jira/browse/SPARK-10473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.0, 1.6.0
>Reporter: SaintBacchus
>
> In the implementation of *EventLoggingListener* , there is only one 
> OutputStream writing event message to HDFS.
> But when the token of the *DFSClient* in the "OutputStream" was expired, the  
> "DFSClient" had no right to write the message and miss all the message behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10879) spark on yarn support priority option

2015-09-29 Thread Yun Zhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Zhao updated SPARK-10879:
-
Component/s: YARN
 Spark Submit

> spark on yarn support priority option
> -
>
> Key: SPARK-10879
> URL: https://issues.apache.org/jira/browse/SPARK-10879
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit, YARN
>Reporter: Yun Zhao
>
> Add a YARN-only option to spark-submit: *--priority PRIORITY* .The priority 
> of your YARN application (Default: 0).
> Add a property: *spark.yarn.priority*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936313#comment-14936313
 ] 

Saisai Shao edited comment on SPARK-10858 at 9/30/15 3:50 AM:
--

Hi [~tgraves], I tested again with Mac and Linux (centos).
 

if we use {{--jars my.jar#renamed.jar}}

this file path will be resolved to URI 
{{file:/Users/sshao/projects/apache-spark/my.jar%23renamed.jar}}

if we use {{--jars 
file:///Users/sshao/projects/apache-spark/my.jar#renamed.jar}}

this file path will be resolved to URI 
{{file:/Users/sshao/projects/apache-spark/my.jar#renamed.jar}}

This is done by Utils#resolveURI

{code}
  def resolveURI(path: String): URI = {
try {
  val uri = new URI(path)
  if (uri.getScheme() != null) {
return uri
  }
} catch {
  case e: URISyntaxException =>
}
new File(path).getAbsoluteFile().toURI()
  }
{code}

Where if scheme is not specified, this code will transform the file path into 
URI, the noted thing is that "#" will be translated into "%23" in this `toURI`.

After digging into the Hadoop code RawLocalFileSystem#pathToFile:

{code}
  public File pathToFile(Path path) {
checkPath(path);
if (!path.isAbsolute()) {
  path = new Path(getWorkingDirectory(), path);
}
return new File(path.toUri().getPath());
  }
{code}

Here using `URI.getPath` to get file path will lead to different behavior if we 
do not escape "#" to "%23", which will treat the part after "#" as fragment, 
not path.
But if we instead using 

{{--jars my.jar%23renamed.jar}} 

or 

{{--jars file:///path/to/my.jar%23renamed.jar}},

it can be succeeded in both way.














was (Author: jerryshao):
Hi [~tgraves], I tested again with Mac and Linux (centos), seems the behavior 
is different.

In Mac, 

if we use {{--jars my.jar#renamed.jar}}

this file path will be resolved to URI 
{{file:/Users/sshao/projects/apache-spark/my.jar%23renamed.jar}}

if we use {{--jars 
file:///Users/sshao/projects/apache-spark/my.jar#renamed.jar}}

this file path will be resolved to URI 
{{file:/Users/sshao/projects/apache-spark/my.jar#renamed.jar}}

This is done by Utils#resolveURI

{code}
  def resolveURI(path: String): URI = {
try {
  val uri = new URI(path)
  if (uri.getScheme() != null) {
return uri
  }
} catch {
  case e: URISyntaxException =>
}
new File(path).getAbsoluteFile().toURI()
  }
{code}

Where if scheme is not specified, this code will transform the file path into 
URI, the noted thing is that "#" will be translated into "%23" in this `toURI`.

In Centos:

both 

{{--jars my.jar#renamed.jar}} 

and 

{{--jars file:///Users/sshao/projects/apache-spark/my.jar#renamed.jar}} 

will be resolved to 
{{file:/Users/sshao/projects/apache-spark/my.jar#renamed.jar}} through 
Utils#resolveURI, obviously "#" is not escaped.

So in my test, both these two ways of using --jars are failed in Centos.

After digging into the Hadoop code RawLocalFileSystem#pathToFile:

{code}
  public File pathToFile(Path path) {
checkPath(path);
if (!path.isAbsolute()) {
  path = new Path(getWorkingDirectory(), path);
}
return new File(path.toUri().getPath());
  }
{code}

Here using `URI.getPath` to get file path will lead to different behavior if we 
do not escape "#" to "%23", which will treat the part after "#" as fragment, 
not path. So in Mac without specifying scheme is succeeded, whereas in Centos 
both two ways are failed.

But if we instead using 

{{--jars my.jar%23renamed.jar}} 

or 

{{--jars file:///path/to/my.jar%23renamed.jar}},

it can be succeeded in Centos.













> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
>

[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936176#comment-14936176
 ] 

Saisai Shao commented on SPARK-10858:
-

The error I got in the failed case is the same as you mentioned above, I'm 
running on Mac OS with Hadoop 2.6.0.

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-10858:

Comment: was deleted

(was: The error I got in the failed case is the same as you mentioned above, 
I'm running on Mac OS with Hadoop 2.6.0.)

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936174#comment-14936174
 ] 

Saisai Shao commented on SPARK-10858:
-

The error I got in the failed case is the same as you mentioned above, I'm 
running on Mac OS with Hadoop 2.6.0.

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given

2015-09-29 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936175#comment-14936175
 ] 

Saisai Shao commented on SPARK-10858:
-

The error I got in the failed case is the same as you mentioned above, I'm 
running on Mac OS with Hadoop 2.6.0.

> YARN: archives/jar/files rename with # doesn't work unless scheme given
> ---
>
> Key: SPARK-10858
> URL: https://issues.apache.org/jira/browse/SPARK-10858
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Priority: Minor
>
> The YARN distributed cache feature with --jars, --archives, --files where you 
> can rename the file/archive using a # symbol only works if you explicitly 
> include the scheme in the path:
> works:
> --jars file:///home/foo/my.jar#renamed.jar
> doesn't work:
> --jars /home/foo/my.jar#renamed.jar
> Exception in thread "main" java.io.FileNotFoundException: File 
> file:/home/foo/my.jar#renamed.jar does not exist
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
> at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
> at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)
> at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240)
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393)
> at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10879) spark on yarn support priority option

2015-09-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936290#comment-14936290
 ] 

Apache Spark commented on SPARK-10879:
--

User 'xiaowen147' has created a pull request for this issue:
https://github.com/apache/spark/pull/8943

> spark on yarn support priority option
> -
>
> Key: SPARK-10879
> URL: https://issues.apache.org/jira/browse/SPARK-10879
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Zhao
>
> Add a YARN-only option to spark-submit: *--priority PRIORITY* .The priority 
> of your YARN application (Default: 0).
> Add a property: *spark.yarn.priority*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10473) EventLog will loss message in the long-running security application

2015-09-29 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-10473:
-
Affects Version/s: 1.6.0

> EventLog will loss message in the long-running security application
> ---
>
> Key: SPARK-10473
> URL: https://issues.apache.org/jira/browse/SPARK-10473
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.0, 1.6.0
>Reporter: SaintBacchus
>
> In the implementation of *EventLoggingListener* , there is only one 
> OutputStream writing event message to HDFS.
> But when the token of the *DFSClient* in the "OutputStream" was expired, the  
> "DFSClient" had no right to write the message and miss all the message behind.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10877) Assertions fail straightforward DataFrame job due to word alignment

2015-09-29 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936259#comment-14936259
 ] 

Matt Cheah commented on SPARK-10877:


Also the error doesn't occur if I turn code-generation off despite keeping 
tungsten on.

> Assertions fail straightforward DataFrame job due to word alignment
> ---
>
> Key: SPARK-10877
> URL: https://issues.apache.org/jira/browse/SPARK-10877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
> Attachments: SparkFilterByKeyTest.scala
>
>
> I have some code that I’m running in a unit test suite, but the code I’m 
> running is failing with an assertion error.
> I have translated the JUnit test that was failing, to a Scala script that I 
> will attach to the ticket. The assertion error is the following:
> {code}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: 
> Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.AssertionError: 
> lengthInBytes must be a multiple of 8 (word-aligned)
> at 
> org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeWords(Murmur3_x86_32.java:53)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.hashCode(UnsafeArrayData.java:289)
> at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.hashCode(rows.scala:149)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.hashCode(rows.scala:247)
> at org.apache.spark.HashPartitioner.getPartition(Partitioner.scala:85)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at 
> org.apache.spark.sql.execution.Exchange$$anonfun$doExecute$1$$anonfun$4$$anonfun$apply$4.apply(Exchange.scala:180)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> {code}
> However, it turns out that this code actually works normally and computes the 
> correct result if assertions are turned off.
> I traced the code and found that when hashUnsafeWords was called, it was 
> given a byte-length of 12, which clearly is not a multiple of 8. However, the 
> job seems to compute correctly regardless of this fact. Of course, I can’t 
> just disable assertions for my unit test though.
> A few things we need to understand:
> 1. Why is the lengthInBytes of size 12?
> 2. Is it actually a problem that the byte length is not word-aligned? If so, 
> how should we fix the byte length? If it's not a problem, why is the 
> assertion flagging a false negative?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10879) spark on yarn support priority option

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10879:


Assignee: (was: Apache Spark)

> spark on yarn support priority option
> -
>
> Key: SPARK-10879
> URL: https://issues.apache.org/jira/browse/SPARK-10879
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Zhao
>
> Add a YARN-only option to spark-submit: *--priority PRIORITY* .The priority 
> of your YARN application (Default: 0).
> Add a property: *spark.yarn.priority*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10879) spark on yarn support priority option

2015-09-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10879:


Assignee: Apache Spark

> spark on yarn support priority option
> -
>
> Key: SPARK-10879
> URL: https://issues.apache.org/jira/browse/SPARK-10879
> Project: Spark
>  Issue Type: Improvement
>Reporter: Yun Zhao
>Assignee: Apache Spark
>
> Add a YARN-only option to spark-submit: *--priority PRIORITY* .The priority 
> of your YARN application (Default: 0).
> Add a property: *spark.yarn.priority*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10857) SQL injection bug in JdbcDialect.getTableExistsQuery()

2015-09-29 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935518#comment-14935518
 ] 

Josh Rosen commented on SPARK-10857:


Spark 1.5.0+ requires Java 7+, so it should be fine to use Java 7 features.

> SQL injection bug in JdbcDialect.getTableExistsQuery()
> --
>
> Key: SPARK-10857
> URL: https://issues.apache.org/jira/browse/SPARK-10857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Rick Hillegas
>Priority: Minor
>
> All of the implementations of this method involve constructing a query by 
> concatenating boilerplate text with a user-supplied name. This looks like a 
> SQL injection bug to me.
> A better solution would be to call java.sql.DatabaseMetaData.getTables() to 
> implement this method, using the catalog and schema which are available from 
> Connection.getCatalog() and Connection.getSchema(). This would not work on 
> Java 6 because Connection.getSchema() was introduced in Java 7. However, the 
> solution would work for more modern JVMs. Limiting the vulnerability to 
> obsolete JVMs would at least be an improvement over the current situation. 
> Java 6 has been end-of-lifed and is not an appropriate platform for users who 
> are concerned about security.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10875) RowMatrix.computeCovariance() result is not exactly symmetric

2015-09-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935548#comment-14935548
 ] 

Sean Owen commented on SPARK-10875:
---

Makes sense to me, do you want to try a PR?

> RowMatrix.computeCovariance() result is not exactly symmetric
> -
>
> Key: SPARK-10875
> URL: https://issues.apache.org/jira/browse/SPARK-10875
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>
> For some matrices, I have seen that the computed covariance matrix is not 
> exactly symmetric, most likely due to some numerical rounding errors. This is 
> problematic when trying to construct an instance of {{MultivariateGaussian}}, 
> because it requires an exactly symmetric covariance matrix. See reproducible 
> example below.
> I would suggest modifying the implementation so that {{G(i, j)}} and {{G(j, 
> i)}} are set at the same time, with the same value.
> {code}
> val rdd = RandomRDDs.normalVectorRDD(sc, 100, 10, 0, 0)
> val matrix = new RowMatrix(rdd)
> val mean = matrix.computeColumnSummaryStatistics().mean
> val cov = matrix.computeCovariance()
> val dist = new MultivariateGaussian(mean, cov) //throws 
> breeze.linalg.MatrixNotSymmetricException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10875) RowMatrix.computeCovariance() result is not exactly symmetric

2015-09-29 Thread Nick Pritchard (JIRA)
Nick Pritchard created SPARK-10875:
--

 Summary: RowMatrix.computeCovariance() result is not exactly 
symmetric
 Key: SPARK-10875
 URL: https://issues.apache.org/jira/browse/SPARK-10875
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Nick Pritchard


For some matrices, I have seen that the computed covariance matrix is not 
exactly symmetric, most likely due to some numerical rounding errors. This is 
problematic when trying to construct an instance of {{MultivariateGaussian}}, 
because it requires an exactly symmetric covariance matrix. See reproducible 
example below.

I would suggest modifying the implementation so that {{G(i, j)}} and {{G(j, 
i)}} are set at the same time, with the same value.

{code}
val rdd = RandomRDDs.normalVectorRDD(sc, 100, 10, 0, 0)
val matrix = new RowMatrix(rdd)
val mean = matrix.computeColumnSummaryStatistics().mean
val cov = matrix.computeCovariance()
val dist = new MultivariateGaussian(mean, cov) //throws 
breeze.linalg.MatrixNotSymmetricException
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10779) Set initialModel for KMeans model in PySpark (spark.mllib)

2015-09-29 Thread Evan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935572#comment-14935572
 ] 

Evan Chen commented on SPARK-10779:
---

I can work on this Jira

> Set initialModel for KMeans model in PySpark (spark.mllib)
> --
>
> Key: SPARK-10779
> URL: https://issues.apache.org/jira/browse/SPARK-10779
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>
> Provide initialModel param for pyspark.mllib.clustering.KMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10873) can't sort columns on history page

2015-09-29 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14935571#comment-14935571
 ] 

Marcelo Vanzin commented on SPARK-10873:


To fix this properly we'd have to change the library used to do sorting. It's 
not terribly hard, though. The current library does not understand rowspans.

> can't sort columns on history page
> --
>
> Key: SPARK-10873
> URL: https://issues.apache.org/jira/browse/SPARK-10873
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Starting with 1.5.1 the history server page isn't allowing sorting by column



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10875) RowMatrix.computeCovariance() result is not exactly symmetric

2015-09-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10875:
--
Priority: Minor  (was: Major)

> RowMatrix.computeCovariance() result is not exactly symmetric
> -
>
> Key: SPARK-10875
> URL: https://issues.apache.org/jira/browse/SPARK-10875
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Nick Pritchard
>Priority: Minor
>
> For some matrices, I have seen that the computed covariance matrix is not 
> exactly symmetric, most likely due to some numerical rounding errors. This is 
> problematic when trying to construct an instance of {{MultivariateGaussian}}, 
> because it requires an exactly symmetric covariance matrix. See reproducible 
> example below.
> I would suggest modifying the implementation so that {{G(i, j)}} and {{G(j, 
> i)}} are set at the same time, with the same value.
> {code}
> val rdd = RandomRDDs.normalVectorRDD(sc, 100, 10, 0, 0)
> val matrix = new RowMatrix(rdd)
> val mean = matrix.computeColumnSummaryStatistics().mean
> val cov = matrix.computeCovariance()
> val dist = new MultivariateGaussian(mean, cov) //throws 
> breeze.linalg.MatrixNotSymmetricException
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >