date:20160316

[jira] [Resolved] (SPARK-3544) SparkSQL thriftServer cannot release locks correctly in Zookeeper

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3544.
--
Resolution: Incomplete

I think this is at least obsolete now; no follow up or other reports of this, 
but not particular resolution either.

> SparkSQL thriftServer cannot release locks correctly in Zookeeper
> -
>
> Key: SPARK-3544
> URL: https://issues.apache.org/jira/browse/SPARK-3544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Patrick Liu
>Priority: Critical
>
> Bug description:
> The thriftServer cannot release locks correctly in Zookeeper.
> Once a thriftServer is started, the 1st SQL submited by Beeline or JDBC 
> requiring locks can be successfully computed.
> However, the 2rd SQL requiring locks will be blocked. And the thriftServer 
> log shows: INFO Driver: .
> 2 Tests to illustrate the problem:
> Test 1:
> (0) Start thriftServer & use beeline to connect to it.
> (1) Switch database(require locks);(OK)
> (2) Drop table(require locks) “drop table src"; (Blocked)
> Test 2:
> (0) Start thriftServer & use beeline to connect to it.
> (1) Drop table(require locks) "drop table src"; (OK)
> (2) Drop another table(require locks) "drop table src2;"(Blocked)
> Basic Information:
> Spark 1.1.0
> Hadoop 2.0.0-cdh4.6.0
> Compile Command:
> export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M 
> -XX:ReservedCodeCacheSize=512m" 
> mvn -Dhadoop.version=2.0.0-cdh4.6.0 -Phive -Pspark-ganglia-lgpl -DskipTests 
> package 
> hive-site.xml:
>  
>
> fs.default.name 
> hdfs://ns 
>
>
> dfs.nameservices 
> ns 
>
>   
>
> dfs.ha.namenodes.ns 
> machine01,machine02 
>
>   
>
> dfs.namenode.rpc-address.ns.machine01 
> machine01:54310 
> 
>
>
> dfs.namenode.rpc-address.ns.machine02 
> machine02:54310 
> 
>
>   
>
> dfs.client.failover.proxy.provider.ns 
> 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
>  
>
>
> javax.jdo.option.ConnectionURL 
> jdbc:mysql://localhost:3306/metastore 
> JDBC connect string for a JDBC metastore 
>
>
> javax.jdo.option.ConnectionDriverName 
> com.mysql.jdbc.Driver 
> Driver class name for a JDBC metastore 
>
>
> javax.jdo.option.ConnectionUserName 
> hive_user 
>
>
> javax.jdo.option.ConnectionPassword 
> hive_123 
>
>
> datanucleus.autoCreateSchema 
> false 
>
>
> datanucleus.autoCreateTables 
> true 
>
>
> datanucleus.fixedDatastore 
> false 
>
>
> hive.support.concurrency 
> Enable Hive's Table Lock Manager Service 
> true 
>
>
> hive.zookeeper.quorum 
> machine01,machine02,machine03 
> Zookeeper quorum used by Hive's Table Lock 
> Manager 
>
>
> hive.metastore.warehouse.dir 
> /user/hive/warehouse 
> Hive warehouse directory 
>
>
> mapred.job.tracker 
> machine01:8032 
>
>
>  io.compression.codecs 
>  org.apache.hadoop.io.compress.SnappyCodec 
>
>
> mapreduce.output.fileoutputformat.compress.codec 
> org.apache.hadoop.io.compress.SnappyCodec 
>
>
> mapreduce.output.fileoutputformat.compress.type 
> BLOCK 
>
>
> hive.exec.show.job.failure.debug.info 
> true 
>  
> If a job fails, whether to provide a link in the CLI to the task with the 
> most failures, along with debugging hints if applicable. 
>  
>
>
> hive.hwi.listen.host 
> 0.0.0.0 
> This is the host address the Hive Web Interface will listen 
> on 
>
>
> hive.hwi.listen.port 
>  
> This is the port the Hive Web Interface will listen 
> on 
>
>
> hive.hwi.war.file 
> /lib/hive-hwi-0.10.0-cdh4.2.0.war 
> This is the WAR file with the jsp content for Hive Web 
> Interface 
>
>
> hive.aux.jars.path 
> 
> file:///usr/lib/hive/lib/hive-hbase-handler-0.10.0-cdh4.6.0.jar,file:///usr/lib/hbase/hbase-0.94.15-cdh4.6.0-security.jar,file:///usr/lib/zookeeper/zookeeper.jar
>  
>
>
>  hbase.zookeeper.quorum 
>  machine01,machine02,machine03 
>
>
> hive.cli.print.header 
> true 
>
>
> hive.metastore.execute.setugi 
> true 
> In unsecure mode, setting this property to true will cause 
> the metastore to execute DFS operations using the client's reported user and 
> group permissions. Note that this property must be set on both the client and 
> server sides. Further note that its best effort. If client sets its to true 
> and server sets it to false, client setting will be ignored. 
>
>
> hive.security.authorization.enabled 
>

[jira] [Resolved] (SPARK-5772) spark-submit --driver-memory not correctly taken into account

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5772.
--
Resolution: Won't Fix

> spark-submit --driver-memory not correctly taken into account
> -
>
> Key: SPARK-5772
> URL: https://issues.apache.org/jira/browse/SPARK-5772
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.2.0, 1.2.1
> Environment: Debian 7
>Reporter: Guillaume Charhon
>Priority: Minor
>
> spark-submit --driver-memory not taken correctly into account
> The spark.driver.memory does not seem to be correctly taken into account. I 
> came across this issue as I had a java.lang.OutOfMemoryError: Java heap space 
> when I was doing a random forest training.
> I did all my tests with 1 master and 4 worker nodes. All machines have 16 
> cores, 106 Gb of RAM running Debian 7 on Google Compute Engine.
> As I had the memory error, I noticed that the master had only 265.4 MB 
> registered on the Executor page of the Web UI. Workers machines have 42.4 GB.
> in command line:
> ../hadoop/spark-install/bin/spark-submit --driver-memory=83971m predict.py 
> --> does NOT work (master memory is not correct)
> in spark-default.conf : 
> spark.driver.memory 83971m 
> -->works
> in spark-env.sh:
> SPARK_DRIVER_MEMORY=83971m 
> -->works
> The spark.driver.memory is displayed with the correct value (83971m) on the 
> Web UI (http://spark-m:4040/environment/
> ) for all the tests. 
> However, we can see on the executor Web ui page 
> (http://spark-m:4040/executors/) that the memory is not correctly allocated 
> when the option is provided with the spark-submit command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5998) Make Spark Streaming checkpoint version compatible

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5998.
--
Resolution: Won't Fix

Looks like this didn't go anywhere (and could be quite complex); it could be 
reopened if needed.

> Make Spark Streaming checkpoint version compatible
> --
>
> Key: SPARK-5998
> URL: https://issues.apache.org/jira/browse/SPARK-5998
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Saisai Shao
>
> Currently Spark Streaming's checkpointing(serialization) mechanism is version 
> mutable, if the code is updated or any changes happened. The old  
> checkpointing files cannot be read again. To keep the long-running and 
> upgradability of streaming App, making the checkpointing mechanism version 
> compatible is meaning.
> Here creating a JIRA as stub of this issue, I'm currently working on this, a 
> design doc will be posted later. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3750) Log ulimit settings at warning if they are too low

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3750:
-
Priority: Minor  (was: Major)

Andrew do you still want to add this?

> Log ulimit settings at warning if they are too low
> --
>
> Key: SPARK-3750
> URL: https://issues.apache.org/jira/browse/SPARK-3750
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Priority: Minor
>
> In recent versions of Spark the shuffle implementation is much more 
> aggressive about writing many files out to disk at once.  Most linux kernels 
> have a default limit in the number of open files per process, and Spark can 
> exhaust this limit.  The current hash-based shuffle implementation requires 
> as many files as the product of the map and reduce partition counts in a wide 
> dependency.
> In order to reduce the errors we're seeing on the user list, we should 
> determine a value that is considered "too low" for normal operations and log 
> a warning on executor startup when that value isn't met.
> 1. determine what ulimit is acceptable
> 2. log when that value isn't met



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2559) Add A Link to Download the Application Events Log for Offline Analysis

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2559.
--
Resolution: Not A Problem

I'm not sure when this was added, or else I'd link it as Fixed, but you can 
download logs from /application/[id]/logs:

http://spark.apache.org/docs/latest/monitoring.html

> Add A Link to Download the Application Events Log for Offline Analysis
> --
>
> Key: SPARK-2559
> URL: https://issues.apache.org/jira/browse/SPARK-2559
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Pat McDonough
>
> To analyze application issues offline (eg. on another machine while 
> supporting an end user), provide end users a link to download an archive of 
> the application event logs. The archive can then by opened via an offline 
> History Server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10912) Improve Spark metrics executor.filesystem

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10912.
---
Resolution: Won't Fix

This is EC2-specific, and support for that has moved out of Spark. I think this 
is specifically a function of the Hadoop InputFormat support for S3, rather 
than Spark

> Improve Spark metrics executor.filesystem
> -
>
> Key: SPARK-10912
> URL: https://issues.apache.org/jira/browse/SPARK-10912
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.0
>Reporter: Yongjia Wang
>Priority: Minor
> Attachments: s3a_metrics.patch
>
>
> In org.apache.spark.executor.ExecutorSource it has 2 filesystem metrics: 
> "hdfs" and "file". I started using s3 as the persistent storage with Spark 
> standalone cluster in EC2, and s3 read/write metrics do not appear anywhere. 
> The 'file' metric appears to be only for driver reading local file, it would 
> be nice to also report shuffle read/write metrics, so it can help with 
> optimization.
> I think these 2 things (s3 and shuffle) are very useful and cover all the 
> missing information about Spark IO especially for s3 setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13143) EC2 cluster silently not destroyed for non-default regions

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13143.
---
Resolution: Won't Fix

I think we're not managing EC2 issues in Spark anymore as the support is no 
longer in Spark

> EC2 cluster silently not destroyed for non-default regions
> --
>
> Key: SPARK-13143
> URL: https://issues.apache.org/jira/browse/SPARK-13143
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.0
>Reporter: Theodore Vasiloudis
>Priority: Minor
>  Labels: trivial
>
> If you start a cluster in a non-default region using the EC2 scripts and then 
> try to destroy it, you get the message:
> {quote}
> Terminating master...
> Terminating slaves...
> {quote}
> after which the script terminates with no further info.
> This leaves the instances still running without ever informing the user.
> The reason this happens is that the destroy action in {{spark_ec2.py}} calls 
> {{get_existing_cluster}} with the {{die_on_error}} argument set to {{False}} 
> for some reason.
> I'll submit a PR for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13913) DataFrame.withColumn fails when trying to replace existing column with dot in name

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13913:
--
Component/s: SQL

> DataFrame.withColumn fails when trying to replace existing column with dot in 
> name
> --
>
> Key: SPARK-13913
> URL: https://issues.apache.org/jira/browse/SPARK-13913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Emmanuel Leroy
>
> http://stackoverflow.com/questions/36000147/spark-1-6-apply-function-to-column-with-dot-in-name-how-to-properly-escape-coln/36005334#36005334
> if I do (column name exists already and has dot in it, but is not a nested 
> column):
> scala> df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))
> scala> df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))
> org.apache.spark.sql.AnalysisException: cannot resolve 'raw.minOfDay' given 
> input columns raw.hourOfDay_2, raw.dayOfWeek, raw.sensor2, raw.hourOfDay, 
> raw.minOfDay;
> at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> but if I do:
> scala> df = df.withColumn("raw.hourOfDay_2", df.col("`raw.hourOfDay`"))
> scala> df.printSchema
> root
>  |-- raw.hourOfDay: long (nullable = true)
>  |-- raw.minOfDay: long (nullable = true)
>  |-- raw.dayOfWeek: long (nullable = true)
>  |-- raw.sensor2: long (nullable = true)
>  |-- raw.hourOfDay_2: long (nullable = true)
> it works fine (i.e. new column is created with dot in ColName).
> The only difference is that the name "raw.hourOfDay_2" does not exist yet, 
> and is properly created as a colName with dot, not as a nested column.
> The documentation however says that if the column exists it will replace it, 
> but it seems there is a miss-interpretation of the column name as a nested 
> column
> def withColumn(colName: String, col: Column): DataFrame
> Returns a new DataFrame by adding a column or replacing the existing column 
> that has the same name.
> Replacing a column without a dot in it works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13889) Integer overflow when calculating the max number of executor failure

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13889.
---
   Resolution: Fixed
 Assignee: Carson Wang
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11713

> Integer overflow when calculating the max number of executor failure
> 
>
> Key: SPARK-13889
> URL: https://issues.apache.org/jira/browse/SPARK-13889
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Carson Wang
>Assignee: Carson Wang
> Fix For: 2.0.0
>
>
> The max number of executor failure before failing the application is default 
> to twice the maximum number of executors if dynamic allocation is enabled. 
> The default value for "spark.dynamicAllocation.maxExecutors" is Int.MaxValue. 
> So this causes an integer overflow and a wrong result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13915) Allow bin/spark-submit to be called via symbolic link

2016-03-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197166#comment-15197166
 ] 

Sean Owen commented on SPARK-13915:
---

(It's not a CDH difference; the script is identical: 
https://github.com/cloudera/spark/blob/cdh5-1.5.0_5.5.2/bin/spark-submit )

I think this is due to a) a custom modification to spark-submit, b) a custom 
environment change, c) a custom build of Spark 1.5.1. symlinks themselves are 
not an issue. I'd have to close this without clarity on what the problem is in 
Spark trunk that is solved by this change, that has to do with symlinks.

> Allow bin/spark-submit to be called via symbolic link
> -
>
> Key: SPARK-13915
> URL: https://issues.apache.org/jira/browse/SPARK-13915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
> Environment: CentOS 6.6
> Tarbal spark distribution and CDH-5.x.x Spark version (both)
>Reporter: Rafael Pecin Ferreira
>Priority: Minor
>
> We have a CDH-5 cluster that comes with spark-1.5.0 and we needed to use 
> spark-1.5.1 for bug fix issues.
> When I set up the spark (out of the CDH box) to the system alternatives, it 
> created a sequence of symbolic links to the target spark installation.
> When I tried to run spark-submit, the bash process call the target with "$0" 
> as /usr/bin/spark-submit, but this script use the "$0" variable to locate its 
> deps and I was facing this messages:
> [hdfs@server01 ~]$ env spark-submit
> ls: cannot access /usr/assembly/target/scala-2.10: No such file or directory
> Failed to find Spark assembly in /usr/assembly/target/scala-2.10.
> You need to build Spark before running this program.
> I fixed the spark-submit script adding this lines:
> if [ -h "$0" ] ; then
> checklink="$0";
> while [ -h "$checklink" ] ; do
> checklink=`readlink $checklink`
> done
> SPARK_HOME="$(cd "`dirname "$checklink"`"/..; pwd)";
> else
> SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)";
> fi
> It would be very nice if this piece of code be put into the spark-submit 
> script to allow us to have multiple spark alternatives on the system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-16 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197165#comment-15197165
 ] 

Steve Loughran commented on SPARK-7481:
---

...thinking some more about this

How about 

# adding a {{spark-cloud}} module which, initially, does nothing but declare 
the dependencies on {{hadoop-aws}}, {{hadoop-openstack}}, and on 2.7+, 
{{hadoop-azure}}. 
# have spark assembly declare a dependency on this module, but explicitly 
excluding all dependencies other than the hadoop ones (i.e. no amazon libs, no 
extra httpclient ones for openstack (if there are any), anything azure wants). 
If someone wants to add the relevant amazon libs, they need to explicitly add 
it on the {{--jars}} option.

Doing it this way means that if a project depends on {{spark-cloud}} it gets 
all the cloud dependencies that version of spark+hadoop needs.

It also provides a placeholder for explicit cloud support, specifically

- output committers that don't try to rename/assume that directory delete is 
atomic and O(1)
- some optional tests/examples to read/write data. 

The tests would be good not just for spark, but for catching regressions in 
hadoop/aws/azure code.

If people think this is good, assign it to me and I'll look at it in april

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13933) hadoop-2.7 profile's curator version should be 2.7.1

2016-03-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197163#comment-15197163
 ] 

Sean Owen commented on SPARK-13933:
---

The curator update itself seems fine but the Guava thing is unfortunate. So 
we'd have to depend on a custom Guava build? Is there an actual problem to not 
updating the curator dep -- if they're API compatible it shouldn't matter 
really.

> hadoop-2.7 profile's curator version should be 2.7.1
> 
>
> Key: SPARK-13933
> URL: https://issues.apache.org/jira/browse/SPARK-13933
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> This is pretty minor, more due diligence than any binary compatibility.
> # the {{hadoop-2.7}} profile declares the curator version to be 2.6.0
> # the actual hadoop-2.7.1 dependency is of curator 2.7.1; this came from 
> HADOOP-11492
> For consistency, the profile can/should be changed. However, note that as 
> well as some incompatibilities defined in HADOOP-11492; the version  of Guava 
> that curator asserts a need for is 15.x. HADOOP-11612 showed what needed to 
> be done to address compatibility problems there; one of the Curator classes 
> had to be forked to make compatible with guava 11+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13933) hadoop-2.7 profile's curator version should be 2.7.1

2016-03-16 Thread Steve Loughran (JIRA)

Steve Loughran created SPARK-13933:
--

 Summary: hadoop-2.7 profile's curator version should be 2.7.1
 Key: SPARK-13933
 URL: https://issues.apache.org/jira/browse/SPARK-13933
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.0
Reporter: Steve Loughran
Priority: Minor


This is pretty minor, more due diligence than any binary compatibility.

# the {{hadoop-2.7}} profile declares the curator version to be 2.6.0
# the actual hadoop-2.7.1 dependency is of curator 2.7.1; this came from 
HADOOP-11492

For consistency, the profile can/should be changed. However, note that as well 
as some incompatibilities defined in HADOOP-11492; the version  of Guava that 
curator asserts a need for is 15.x. HADOOP-11612 showed what needed to be done 
to address compatibility problems there; one of the Curator classes had to be 
forked to make compatible with guava 11+





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13843) Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages

2016-03-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197147#comment-15197147
 ] 

Sean Owen commented on SPARK-13843:
---

Yeah same question about Ganglia support

> Move streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, 
> streaming-twitter to Spark packages
> ---
>
> Key: SPARK-13843
> URL: https://issues.apache.org/jira/browse/SPARK-13843
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Currently there are a few sub-projects, each for integrating with different 
> external sources for Streaming.  Now that we have better ability to include 
> external libraries (Spark packages) and with Spark 2.0 coming up, we can move 
> the following projects out of Spark to https://github.com/spark-packages
> - streaming-flume
> - streaming-akka
> - streaming-mqtt
> - streaming-zeromq
> - streaming-twitter
> They are just some ancillary packages and considering the overhead of 
> maintenance, running tests and PR failures, it's better to maintain them out 
> of Spark. In addition, these projects can have their different release cycles 
> and we can release them faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-03-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197142#comment-15197142
 ] 

Sean Owen commented on SPARK-13928:
---

Are you suggesting exposing it, or hiding it further? I wasn't clear. I think 
this is properly private to Spark, as user apps can directly access any logging 
package they like.

> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-16 Thread Tien-Dung LE (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-13932:
-
Description: 
A complex aggregate query using condition in the aggregate function and GROUP 
BY HAVING clause raises an exception. This issue only happens in Spark version 
1.6.+ but not in Spark 1.5.+.

Here is a typical error message {code}
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
{code}

Here is a code snippet to re-produce the error in a spark-shell session:
{code}
import sqlContext.implicits._

case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
  b: Int = (math.random*1e3).toInt,
  n: Int = (math.random*1e3).toInt,
  m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )

val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS k3, 
GROUPING__ID"
val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS k2, 
SUM(m) AS k3, GROUPING__ID"

val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"

sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK

sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
{code}

And here is the full log
{code}
scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
bigint, k3: double, GROUPING__ID: int]

scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
bigint, k3: double, GROUPING__ID: int]

scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:316)
at

[jira] [Updated] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-16 Thread Tien-Dung LE (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien-Dung LE updated SPARK-13932:
-
Description: 
A complex aggregate query using condition in the aggregate function and GROUP 
BY HAVING clause raises an exception. Here is a typical erro message {code}
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
{code}

Here is a code snippet to re-produce the error in a spark-shell session:
{code}
import sqlContext.implicits._

case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
  b: Int = (math.random*1e3).toInt,
  n: Int = (math.random*1e3).toInt,
  m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )

val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS k3, 
GROUPING__ID"
val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS k2, 
SUM(m) AS k3, GROUPING__ID"

val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"

sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK

sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
{code}

Here is the full log
{code}
scala> sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" )
res12: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
bigint, k3: double, GROUPING__ID: int]

scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" )
res13: org.apache.spark.sql.DataFrame = [a: string, b: int, k1: bigint, k2: 
bigint, k3: double, GROUPING__ID: int]

scala> sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4$$anonfun$26.apply(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:471)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:467)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:316)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:316)
at

[jira] [Created] (SPARK-13932) CUBE Query with filter (HAVING) and condition (IF) raises an AnalysisException

2016-03-16 Thread Tien-Dung LE (JIRA)

Tien-Dung LE created SPARK-13932:


 Summary: CUBE Query with filter (HAVING) and condition (IF) raises 
an AnalysisException
 Key: SPARK-13932
 URL: https://issues.apache.org/jira/browse/SPARK-13932
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1, 1.6.0
Reporter: Tien-Dung LE


A complex aggregate query using condition in the aggregate function and GROUP 
BY HAVING clause raises an exception. Here is a typical erro message {code}
org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: 
b#55, b#124.; line 1 pos 178
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
{code}

Here is a code snippet to re-produce the error in a spark-shell session:
{code}
import sqlContext.implicits._

case class Toto(  a: String = f"${(math.random*1e6).toLong}%06.0f",
  b: Int = (math.random*1e3).toInt,
  n: Int = (math.random*1e3).toInt,
  m: Double = (math.random*1e3))

val data = sc.parallelize(1 to 1e6.toInt).map(i => Toto())
val df: org.apache.spark.sql.DataFrame = sqlContext.createDataFrame( data )

df.registerTempTable( "toto" )

val sqlSelect1   = "SELECT a, b, COUNT(1) AS k1, COUNT(1) AS k2, SUM(m) AS k3, 
GROUPING__ID"
val sqlSelect2   = "SELECT a, b, COUNT(1) AS k1, COUNT(IF(n > 500,1,0)) AS k2, 
SUM(m) AS k3, GROUPING__ID"

val sqlGroupBy  = "FROM toto GROUP BY a, b GROUPING SETS ((a,b),(a),(b))"
val sqlHaving   = "HAVING ((GROUPING__ID & 1) == 1) AND (b > 500)"

sqlContext.sql( s"$sqlSelect1 $sqlGroupBy $sqlHaving" ) // OK
sqlContext.sql( s"$sqlSelect2 $sqlGroupBy" ) // OK

sqlContext.sql( s"$sqlSelect2 $sqlGroupBy $sqlHaving" ) // ERROR
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13793) PipedRDD doesn't propagate exceptions while reading parent RDDd

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13793:
--
Assignee: Tejas Patil

> PipedRDD doesn't propagate exceptions while reading parent RDDd
> ---
>
> Key: SPARK-13793
> URL: https://issues.apache.org/jira/browse/SPARK-13793
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.0.0
>
>
> PipeRDD creates a process to run the command and spawns a thread to feed the 
> input data to the process's stdin. If there is any exception in the child 
> thread which gets the input data from the parent RDD, the child thread does 
> not propagate that exception to the main thread. eg. In event of fetch 
> failures, since the exception is not be propagated, the entire stage fails. 
> The correct behaviour would be to recompute the parent(s) and then relaunch 
> the stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13793) PipedRDD doesn't propagate exceptions while reading parent RDDd

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13793.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11628
[https://github.com/apache/spark/pull/11628]

> PipedRDD doesn't propagate exceptions while reading parent RDDd
> ---
>
> Key: SPARK-13793
> URL: https://issues.apache.org/jira/browse/SPARK-13793
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Tejas Patil
>Priority: Minor
> Fix For: 2.0.0
>
>
> PipeRDD creates a process to run the command and spawns a thread to feed the 
> input data to the process's stdin. If there is any exception in the child 
> thread which gets the input data from the parent RDD, the child thread does 
> not propagate that exception to the main thread. eg. In event of fetch 
> failures, since the exception is not be propagated, the entire stage fails. 
> The correct behaviour would be to recompute the parent(s) and then relaunch 
> the stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12836) spark enable both driver run executor & write to HDFS

2016-03-16 Thread Lior Chaga (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197120#comment-15197120
 ] 

Lior Chaga commented on SPARK-12836:


I used the --no-switch_user mesos config, and it worked. Writing to hadoop was 
with HADOOP_USER_NAME, while spark executors were running with the mesos-slave 
user permissions.

> spark enable both driver run executor & write to HDFS
> -
>
> Key: SPARK-12836
> URL: https://issues.apache.org/jira/browse/SPARK-12836
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos, Scheduler, Spark Core
>Affects Versions: 1.6.0
> Environment: HADOOP_USER_NAME=qhstats
> SPARK_USER=root
>Reporter: astralidea
>  Labels: features
>
> when spark set env HADOOP_USER_NAME CoarseMesosSchedulerBackend will set 
> sparkuser from this env, but in my cluster run spark must be root, write HDFS 
> must set HADOOP_USER_NAME, need a configuration set run executor by root & 
> write hdfs by another.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13931) Resolve stage hanging up problem in a particular case

2016-03-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197099#comment-15197099
 ] 

Apache Spark commented on SPARK-13931:
--

User 'GavinGavinNo1' has created a pull request for this issue:
https://github.com/apache/spark/pull/11760

> Resolve stage hanging up problem in a particular case
> -
>
> Key: SPARK-13931
> URL: https://issues.apache.org/jira/browse/SPARK-13931
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 1.6.1
>Reporter: ZhengYaofeng
>
> Suppose the following steps:
> 1. Open speculation switch in the application. 
> 2. Run this app and suppose last task of shuffleMapStage 1 finishes. Let's 
> get the record straight, from the eyes of DAG, this stage really finishes, 
> and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but 
> variable runningTasksSet isn't empty because of speculation.
> 3. Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes 
> all executorLost functions of rootPool's taskSetManagers. DAG receiving this 
> signal, removes all this executor's outputLocs.
> 4. TaskSetManager adds all this executor's tasks to pendingTasks and tells 
> DAG they will be resubmitted (Attention: possibly not on time).
> 5. DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and 
> going to find that shuffleMapStage 1 is its missing parent because some 
> outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 
> 1 again.
> 6. DAG still receives Task 'Resubmitted' signal from old taskSetManager, and 
> increases the number of pendingTasks of shuffleMapStage 1 each time. However, 
> old taskSetManager won't resolve new task to submit because its variable 
> 'isZombie' is set to true.
> 7. Finally shuffleMapStage 1 never finishes in DAG together with all stages 
> depending on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13931) Resolve stage hanging up problem in a particular case

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13931:


Assignee: (was: Apache Spark)

> Resolve stage hanging up problem in a particular case
> -
>
> Key: SPARK-13931
> URL: https://issues.apache.org/jira/browse/SPARK-13931
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 1.6.1
>Reporter: ZhengYaofeng
>
> Suppose the following steps:
> 1. Open speculation switch in the application. 
> 2. Run this app and suppose last task of shuffleMapStage 1 finishes. Let's 
> get the record straight, from the eyes of DAG, this stage really finishes, 
> and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but 
> variable runningTasksSet isn't empty because of speculation.
> 3. Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes 
> all executorLost functions of rootPool's taskSetManagers. DAG receiving this 
> signal, removes all this executor's outputLocs.
> 4. TaskSetManager adds all this executor's tasks to pendingTasks and tells 
> DAG they will be resubmitted (Attention: possibly not on time).
> 5. DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and 
> going to find that shuffleMapStage 1 is its missing parent because some 
> outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 
> 1 again.
> 6. DAG still receives Task 'Resubmitted' signal from old taskSetManager, and 
> increases the number of pendingTasks of shuffleMapStage 1 each time. However, 
> old taskSetManager won't resolve new task to submit because its variable 
> 'isZombie' is set to true.
> 7. Finally shuffleMapStage 1 never finishes in DAG together with all stages 
> depending on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13396) Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13396:
--
Assignee: Gayathri Murali

> Stop using our internal deprecated .metrics on ExceptionFailure instead use 
> accumUpdates
> 
>
> Key: SPARK-13396
> URL: https://issues.apache.org/jira/browse/SPARK-13396
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Assignee: Gayathri Murali
>Priority: Minor
> Fix For: 2.0.0
>
>
> src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala:385: value 
> metrics in class ExceptionFailure is deprecated: use accumUpdates instead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13931) Resolve stage hanging up problem in a particular case

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13931:


Assignee: Apache Spark

> Resolve stage hanging up problem in a particular case
> -
>
> Key: SPARK-13931
> URL: https://issues.apache.org/jira/browse/SPARK-13931
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 1.6.1
>Reporter: ZhengYaofeng
>Assignee: Apache Spark
>
> Suppose the following steps:
> 1. Open speculation switch in the application. 
> 2. Run this app and suppose last task of shuffleMapStage 1 finishes. Let's 
> get the record straight, from the eyes of DAG, this stage really finishes, 
> and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but 
> variable runningTasksSet isn't empty because of speculation.
> 3. Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes 
> all executorLost functions of rootPool's taskSetManagers. DAG receiving this 
> signal, removes all this executor's outputLocs.
> 4. TaskSetManager adds all this executor's tasks to pendingTasks and tells 
> DAG they will be resubmitted (Attention: possibly not on time).
> 5. DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and 
> going to find that shuffleMapStage 1 is its missing parent because some 
> outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 
> 1 again.
> 6. DAG still receives Task 'Resubmitted' signal from old taskSetManager, and 
> increases the number of pendingTasks of shuffleMapStage 1 each time. However, 
> old taskSetManager won't resolve new task to submit because its variable 
> 'isZombie' is set to true.
> 7. Finally shuffleMapStage 1 never finishes in DAG together with all stages 
> depending on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13396) Stop using our internal deprecated .metrics on ExceptionFailure instead use accumUpdates

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13396.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11544
[https://github.com/apache/spark/pull/11544]

> Stop using our internal deprecated .metrics on ExceptionFailure instead use 
> accumUpdates
> 
>
> Key: SPARK-13396
> URL: https://issues.apache.org/jira/browse/SPARK-13396
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: holdenk
>Priority: Minor
> Fix For: 2.0.0
>
>
> src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala:385: value 
> metrics in class ExceptionFailure is deprecated: use accumUpdates instead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13175) Cleanup deprecation warnings from Scala 2.11 upgrade

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13175.
---
   Resolution: Done
 Assignee: holdenk
Fix Version/s: 2.0.0

> Cleanup deprecation warnings from Scala 2.11 upgrade
> 
>
> Key: SPARK-13175
> URL: https://issues.apache.org/jira/browse/SPARK-13175
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Spark Core, SQL, Streaming
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
> Fix For: 2.0.0
>
>
> We should cleanup whatever deprecation warnings we can now that the build 
> defaults to 2.11 - some of them are important (e.g. old behaviour was 
> unreliable) and others just distract from finding the important ones in our 
> build. We should try and fix both. This can be thought of as the follow up to 
> https://issues.apache.org/jira/browse/SPARK-721 where we fixed a bunch of the 
> warnings back in the day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13397) Cleanup transient annotations which aren't being applied

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13397.
---
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11725

> Cleanup transient annotations which aren't being applied
> 
>
> Key: SPARK-13397
> URL: https://issues.apache.org/jira/browse/SPARK-13397
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, Spark Core, SQL, Streaming
>Reporter: holdenk
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.0.0
>
>
> A number of places we have transient markers which are discarded as unused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13395) Silence or skip unsafe deprecation warnings

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13395.
---
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/11725

> Silence or skip unsafe deprecation warnings
> ---
>
> Key: SPARK-13395
> URL: https://issues.apache.org/jira/browse/SPARK-13395
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: holdenk
>Assignee: Sean Owen
>Priority: Trivial
> Fix For: 2.0.0
>
>
> A number of places inside of Spark we use the unsafe API which produces 
> warnings we probably want to silence if its possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13906) Spark driver hangs when slave is started or stopped (org.apache.spark.rpc.RpcTimeoutException).

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13906:
--
Assignee: Yonathan Randolph

> Spark driver hangs when slave is started or stopped 
> (org.apache.spark.rpc.RpcTimeoutException).
> ---
>
> Key: SPARK-13906
> URL: https://issues.apache.org/jira/browse/SPARK-13906
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: Machine with one core (e.g. ec2 t2.small)
>Reporter: Yonathan Randolph
>Assignee: Yonathan Randolph
>Priority: Minor
> Fix For: 2.0.0
>
>
> When a slave is started or stopped and there is only one core, the spark 
> driver hangs. Example:
> {code}
> spark-1.6.1-bin-hadoop2.6/sbin/start-master.sh
> spark-1.6.1-bin-hadoop2.6/sbin/start-slave.sh $(hostname):7077
> spark-1.6.1-bin-hadoop2.6/bin/spark-shell --master spark://$(hostname):7077
> spark> sc.parallelize(1 to 300, 20).map(x => {Thread.sleep(100); 
> x*2}).collect()
> # While that is running, kill a slave
> spark-1.6.1-bin-hadoop2.6/sbin/stop-slave.sh
> {code}
> After 2 minutes, spark-shell spits out an error:
> {code}
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
> seconds. This timeout is controlled by spark.rpc.askTimeout
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>   at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
>   at scala.util.Try$.apply(Try.scala:161)
>   at scala.util.Failure.recover(Try.scala:185)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
>

[jira] [Resolved] (SPARK-13906) Spark driver hangs when slave is started or stopped (org.apache.spark.rpc.RpcTimeoutException).

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13906.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11728
[https://github.com/apache/spark/pull/11728]

> Spark driver hangs when slave is started or stopped 
> (org.apache.spark.rpc.RpcTimeoutException).
> ---
>
> Key: SPARK-13906
> URL: https://issues.apache.org/jira/browse/SPARK-13906
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: Machine with one core (e.g. ec2 t2.small)
>Reporter: Yonathan Randolph
>Priority: Minor
> Fix For: 2.0.0
>
>
> When a slave is started or stopped and there is only one core, the spark 
> driver hangs. Example:
> {code}
> spark-1.6.1-bin-hadoop2.6/sbin/start-master.sh
> spark-1.6.1-bin-hadoop2.6/sbin/start-slave.sh $(hostname):7077
> spark-1.6.1-bin-hadoop2.6/bin/spark-shell --master spark://$(hostname):7077
> spark> sc.parallelize(1 to 300, 20).map(x => {Thread.sleep(100); 
> x*2}).collect()
> # While that is running, kill a slave
> spark-1.6.1-bin-hadoop2.6/sbin/stop-slave.sh
> {code}
> After 2 minutes, spark-shell spits out an error:
> {code}
> org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 
> seconds. This timeout is controlled by spark.rpc.askTimeout
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>   at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
>   at scala.util.Try$.apply(Try.scala:161)
>   at scala.util.Failure.recover(Try.scala:185)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.complete(Promise.scala:55)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:241)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
>

[jira] [Created] (SPARK-13931) Resolve stage hanging up problem in a particular case

2016-03-16 Thread ZhengYaofeng (JIRA)

ZhengYaofeng created SPARK-13931:


 Summary: Resolve stage hanging up problem in a particular case
 Key: SPARK-13931
 URL: https://issues.apache.org/jira/browse/SPARK-13931
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.6.1, 1.6.0, 1.5.2, 1.4.1
Reporter: ZhengYaofeng


Suppose the following steps:
1. Open speculation switch in the application. 
2. Run this app and suppose last task of shuffleMapStage 1 finishes. Let's get 
the record straight, from the eyes of DAG, this stage really finishes, and from 
the eyes of TaskSetManager, variable 'isZombie' is set to true, but variable 
runningTasksSet isn't empty because of speculation.
3. Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes 
all executorLost functions of rootPool's taskSetManagers. DAG receiving this 
signal, removes all this executor's outputLocs.
4. TaskSetManager adds all this executor's tasks to pendingTasks and tells DAG 
they will be resubmitted (Attention: possibly not on time).
5. DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and 
going to find that shuffleMapStage 1 is its missing parent because some 
outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage 1 
again.
6. DAG still receives Task 'Resubmitted' signal from old taskSetManager, and 
increases the number of pendingTasks of shuffleMapStage 1 each time. However, 
old taskSetManager won't resolve new task to submit because its variable 
'isZombie' is set to true.
7. Finally shuffleMapStage 1 never finishes in DAG together with all stages 
depending on it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13930) Apply fast serialization on collect limit

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13930:


Assignee: (was: Apache Spark)

> Apply fast serialization on collect limit
> -
>
> Key: SPARK-13930
> URL: https://issues.apache.org/jira/browse/SPARK-13930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Recently the fast serialization has been introduced to collecting 
> DataFrame/Dataset. The same technology can be used on collect limit operator 
> too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13930) Apply fast serialization on collect limit

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13930:


Assignee: Apache Spark

> Apply fast serialization on collect limit
> -
>
> Key: SPARK-13930
> URL: https://issues.apache.org/jira/browse/SPARK-13930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Recently the fast serialization has been introduced to collecting 
> DataFrame/Dataset. The same technology can be used on collect limit operator 
> too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13930) Apply fast serialization on collect limit

2016-03-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197078#comment-15197078
 ] 

Apache Spark commented on SPARK-13930:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11759

> Apply fast serialization on collect limit
> -
>
> Key: SPARK-13930
> URL: https://issues.apache.org/jira/browse/SPARK-13930
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Recently the fast serialization has been introduced to collecting 
> DataFrame/Dataset. The same technology can be used on collect limit operator 
> too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13930) Apply fast serialization on collect limit

2016-03-16 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-13930:
---

 Summary: Apply fast serialization on collect limit
 Key: SPARK-13930
 URL: https://issues.apache.org/jira/browse/SPARK-13930
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


Recently the fast serialization has been introduced to collecting 
DataFrame/Dataset. The same technology can be used on collect limit operator 
too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12653:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Re-enable test "SPARK-8489: MissingRequirementError during reflection"
> --
>
> Key: SPARK-12653
> URL: https://issues.apache.org/jira/browse/SPARK-12653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> This test case was disabled in 
> https://github.com/apache/spark/pull/10569#discussion-diff-48813840
> I think we need to rebuild the jar because it was compiled against an old 
> version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12653:
--
Assignee: Dongjoon Hyun

> Re-enable test "SPARK-8489: MissingRequirementError during reflection"
> --
>
> Key: SPARK-12653
> URL: https://issues.apache.org/jira/browse/SPARK-12653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This test case was disabled in 
> https://github.com/apache/spark/pull/10569#discussion-diff-48813840
> I think we need to rebuild the jar because it was compiled against an old 
> version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12855) Remove parser pluggability

2016-03-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197064#comment-15197064
 ] 

Apache Spark commented on SPARK-12855:
--

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/11758

> Remove parser pluggability
> --
>
> Key: SPARK-12855
> URL: https://issues.apache.org/jira/browse/SPARK-12855
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This pull request removes the public developer parser API for external 
> parsers. Given everything a parser depends on (e.g. logical plans and 
> expressions) are internal and not stable, external parsers will break with 
> every release of Spark. It is a bad idea to create the illusion that Spark 
> actually supports pluggable parsers. In addition, this also reduces 
> incentives for 3rd party projects to contribute parse improvements back to 
> Spark.
> The number of applications that are using this feature is small (as far as I 
> know it came down from two to one as of Jan 2016, and will be 0 once we have 
> better ansi SQL support).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12653.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11744
[https://github.com/apache/spark/pull/11744]

> Re-enable test "SPARK-8489: MissingRequirementError during reflection"
> --
>
> Key: SPARK-12653
> URL: https://issues.apache.org/jira/browse/SPARK-12653
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 2.0.0
>
>
> This test case was disabled in 
> https://github.com/apache/spark/pull/10569#discussion-diff-48813840
> I think we need to rebuild the jar because it was compiled against an old 
> version of Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1359) SGD implementation is not efficient

2016-03-16 Thread Mohamed Baddar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197062#comment-15197062
 ] 

Mohamed Baddar commented on SPARK-1359:
---

[~mengxr] If this issue is still of interest and nobody is working on it , I 
can start implementation.

> SGD implementation is not efficient
> ---
>
> Key: SPARK-1359
> URL: https://issues.apache.org/jira/browse/SPARK-1359
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Xiangrui Meng
>
> The SGD implementation samples a mini-batch to compute the stochastic 
> gradient. This is not efficient because examples are provided via an iterator 
> interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13914) Add functionality to back up spark event logs

2016-03-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13914.
---
Resolution: Won't Fix

> Add functionality to back up spark event logs
> -
>
> Key: SPARK-13914
> URL: https://issues.apache.org/jira/browse/SPARK-13914
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0, 1.6.2, 2.0.0
>Reporter: Parag Chaudhari
>
> Spark event logs are usually stored in HDFS when running Spark on YARN. In a 
> cloud environment, these HDFS files are often stored on the disks of 
> ephemeral instances that could go away once the instances are terminated. 
> Users may want to persist the event logs as the event happens for issue 
> investigation and performance analysis before and after the cluster is 
> terminated. The backup path can be managed by the spark users based on their 
> needs. For example, some users may copy the event logs to a cloud storage 
> service directly and keep them there forever. While some other users may want 
> to store the event logs on local disks and back them up to a cloud storage 
> service from time to time. Other users will not want to use the feature, so 
> this feature should be off by default; users enable the feature when and only 
> when they need it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13785) Deprecate model field in ML model summary classes

2016-03-16 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196971#comment-15196971
 ] 

Yanbo Liang commented on SPARK-13785:
-

[~josephkb] Can I work on this? 
I vote to make model field private from 2.0. Because all summary classes are 
experimental, users can get model directly and we do not have Python API for 
summary classes currently. 
Meanwhile, I think we also should not expose the name of the columns in the 
public API such as featureCol, labelCol. We can remove other unnecessary fields 
for model summary classes.

> Deprecate model field in ML model summary classes
> -
>
> Key: SPARK-13785
> URL: https://issues.apache.org/jira/browse/SPARK-13785
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> ML model summary classes (e.g., LinearRegressionSummary) currently expose a 
> field "model" containing the parent model.  It's weird to have this circular 
> reference, and I don't see a good reason why the summary should expose it 
> (unless I'm forgetting some decision we made before...).
> I'd propose to deprecate that field in 2.0 and to remove it in 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13929) Use Scala reflection for UDFs

2016-03-16 Thread Jakob Odersky (JIRA)

Jakob Odersky created SPARK-13929:
-

 Summary: Use Scala reflection for UDFs
 Key: SPARK-13929
 URL: https://issues.apache.org/jira/browse/SPARK-13929
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Jakob Odersky
Priority: Minor


{{ScalaReflection}} uses native Java reflection for User Defined Types which 
would fail if such types are not plain Scala classes that map 1:1 to Java.

Consider the following extract (from here 
https://github.com/apache/spark/blob/92024797a4fad594b5314f3f3be5c6be2434de8a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L376
 ):
{code}
case t if Utils.classIsLoadable(className) &&
Utils.classForName(className).isAnnotationPresent(classOf[SQLUserDefinedType]) 
=>

val udt = 
Utils.classForName(className).getAnnotation(classOf[SQLUserDefinedType]).udt().newInstance()
//...
{code}

If {{t}}'s runtime class is actually synthetic (something that doesn't exist in 
Java and hence uses a dollar sign internally), such as nested classes or 
package objects, the above code will fail.

Currently there are no known use-cases of synthetic user-defined types (hence 
the minor priority), however it would be best practice to remove plain Java 
reflection and rely on Scala reflection instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2016-03-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10788:
--
Target Version/s: 2.0.0

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2016-03-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10788:
--
Shepherd: Joseph K. Bradley

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2016-03-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10788:
--
Assignee: Seth Hendrickson

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Minor
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13118) Support for classes defined in package objects

2016-03-16 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196928#comment-15196928
 ] 

Jakob Odersky commented on SPARK-13118:
---

Upate: there was actually with inner classes (or package objects or any other 
synthetic class containing a dollar sign), however it only occurs when wrapped 
in option types.

> Support for classes defined in package objects
> --
>
> Key: SPARK-13118
> URL: https://issues.apache.org/jira/browse/SPARK-13118
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>
> When you define a class inside of a package object, the name ends up being 
> something like {{org.mycompany.project.package$MyClass}}.  However, when 
> reflect on this we try and load {{org.mycompany.project.MyClass}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13927) Add row/column iterator to local matrices

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13927:


Assignee: Apache Spark  (was: Xiangrui Meng)

> Add row/column iterator to local matrices
> -
>
> Key: SPARK-13927
> URL: https://issues.apache.org/jira/browse/SPARK-13927
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>
> Add row/column iterator to local matrices to simplify tasks like BlockMatrix 
> => RowMatrix conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13927) Add row/column iterator to local matrices

2016-03-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196900#comment-15196900
 ] 

Apache Spark commented on SPARK-13927:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/11757

> Add row/column iterator to local matrices
> -
>
> Key: SPARK-13927
> URL: https://issues.apache.org/jira/browse/SPARK-13927
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> Add row/column iterator to local matrices to simplify tasks like BlockMatrix 
> => RowMatrix conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-13855.
-
   Resolution: Fixed
Fix Version/s: 1.6.1

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>Assignee: Patrick Wendell
> Fix For: 1.6.1
>
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-16 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196901#comment-15196901
 ] 

Patrick Wendell commented on SPARK-13855:
-

I've uploaded the artifacts, thanks.

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>Assignee: Patrick Wendell
> Fix For: 1.6.1
>
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13927) Add row/column iterator to local matrices

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13927:


Assignee: Xiangrui Meng  (was: Apache Spark)

> Add row/column iterator to local matrices
> -
>
> Key: SPARK-13927
> URL: https://issues.apache.org/jira/browse/SPARK-13927
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> Add row/column iterator to local matrices to simplify tasks like BlockMatrix 
> => RowMatrix conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13855) Spark 1.6.1 artifacts not found in S3 bucket / direct download

2016-03-16 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reassigned SPARK-13855:
---

Assignee: Patrick Wendell  (was: Michael Armbrust)

> Spark 1.6.1 artifacts not found in S3 bucket / direct download
> --
>
> Key: SPARK-13855
> URL: https://issues.apache.org/jira/browse/SPARK-13855
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: production
>Reporter: Sandesh Deshmane
>Assignee: Patrick Wendell
>
> Getting below error while deploying spark on EC2 with version 1.6.1
> [timing] scala init:  00h 00m 12s
> Initializing spark
> --2016-03-14 07:05:30--  
> http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.50.12
> Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.50.12|:80... 
> connected.
> HTTP request sent, awaiting response... 404 Not Found
> 2016-03-14 07:05:30 ERROR 404: Not Found.
> ERROR: Unknown Spark version
> spark/init.sh: line 137: return: -1: invalid option
> return: usage: return [n]
> Unpacking Spark
> tar (child): spark-*.tgz: Cannot open: No such file or directory
> tar (child): Error is not recoverable: exiting now
> tar: Child returned status 2
> tar: Error is not recoverable: exiting now
> rm: cannot remove `spark-*.tgz': No such file or directory
> mv: missing destination file operand after `spark'
> Try `mv --help' for more information.
> Checked s3 bucket spark-related-packages and noticed that no spark 1.6.1 
> present



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-03-16 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-13928:
---

 Summary: Move org.apache.spark.Logging into 
org.apache.spark.internal.Logging
 Key: SPARK-13928
 URL: https://issues.apache.org/jira/browse/SPARK-13928
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin


Logging was made private in Spark 2.0. If we move it, then users would be able 
to create a Logging trait themselves to avoid changing their own code. 
Alternatively, we can also provide in a compatibility package that adds logging.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13927) Add row/column iterator to local matrix

2016-03-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13927:
--
Summary: Add row/column iterator to local matrix  (was: Add row/column 
iterator to matrix)

> Add row/column iterator to local matrix
> ---
>
> Key: SPARK-13927
> URL: https://issues.apache.org/jira/browse/SPARK-13927
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> Add row/column iterator to local matrices to simplify tasks like BlockMatrix 
> => RowMatrix conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13927) Add row/column iterator to local matrices

2016-03-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13927:
--
Summary: Add row/column iterator to local matrices  (was: Add row/column 
iterator to local matrix)

> Add row/column iterator to local matrices
> -
>
> Key: SPARK-13927
> URL: https://issues.apache.org/jira/browse/SPARK-13927
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> Add row/column iterator to local matrices to simplify tasks like BlockMatrix 
> => RowMatrix conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13927) Add row/column iterator to matrix

2016-03-16 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-13927:
-

 Summary: Add row/column iterator to matrix
 Key: SPARK-13927
 URL: https://issues.apache.org/jira/browse/SPARK-13927
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor


Add row/column iterator to local matrices to simplify tasks like BlockMatrix => 
RowMatrix conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13764) Parse modes in JSON data source

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13764:


Assignee: (was: Apache Spark)

> Parse modes in JSON data source
> ---
>
> Key: SPARK-13764
> URL: https://issues.apache.org/jira/browse/SPARK-13764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, JSON data source just fails to read if some JSON documents are 
> malformed.
> Therefore, if there are two JSON documents below:
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> This will fail emitting the exception below :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> So, just like the parse modes in CSV data source, (See 
> https://github.com/databricks/spark-csv), it would be great if there are some 
> parse modes so that users do not have to filter or pre-process themselves.
> This happens only when custom schema is set. when this uses inferred schema, 
> then it infers the type as {{StringType}} which reads the data successfully 
> anyway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13764) Parse modes in JSON data source

2016-03-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196887#comment-15196887
 ] 

Apache Spark commented on SPARK-13764:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11756

> Parse modes in JSON data source
> ---
>
> Key: SPARK-13764
> URL: https://issues.apache.org/jira/browse/SPARK-13764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, JSON data source just fails to read if some JSON documents are 
> malformed.
> Therefore, if there are two JSON documents below:
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> This will fail emitting the exception below :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> So, just like the parse modes in CSV data source, (See 
> https://github.com/databricks/spark-csv), it would be great if there are some 
> parse modes so that users do not have to filter or pre-process themselves.
> This happens only when custom schema is set. when this uses inferred schema, 
> then it infers the type as {{StringType}} which reads the data successfully 
> anyway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13764) Parse modes in JSON data source

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13764:


Assignee: Apache Spark

> Parse modes in JSON data source
> ---
>
> Key: SPARK-13764
> URL: https://issues.apache.org/jira/browse/SPARK-13764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, JSON data source just fails to read if some JSON documents are 
> malformed.
> Therefore, if there are two JSON documents below:
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> This will fail emitting the exception below :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> So, just like the parse modes in CSV data source, (See 
> https://github.com/databricks/spark-csv), it would be great if there are some 
> parse modes so that users do not have to filter or pre-process themselves.
> This happens only when custom schema is set. when this uses inferred schema, 
> then it infers the type as {{StringType}} which reads the data successfully 
> anyway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:45 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, 10)
val topKUsersForItems = model.recommendUsers(df, 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2 - requires to (re)specify the user / item input col in the input DF
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195702#comment-15195702
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:42 AM:
-

Also, what's nice in the ML API is that SPARK-10802 is essentially taken care 
of by passing in a DataFrame with the users of interest, e.g.
{code}
val users = df.filter(df("age") > 21)
val topK = model.setK(10).setUserTopKCol("userTopK").transform(users)
{code}


was (Author: mlnick):
Also, what's nice in the ML API is that SPARK-10802 is essentially taken care 
of by passing in a DataFrame with the users of interest, e.g.
{code}
val users = df.filter(df("age") > 21)
val topK = model.setK(10).setTopKCol("userId").transform(users)
{code}

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:42 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2 - requires to (re)specify the user / item input col in the input DF
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, 10)
val topKUsersForItems = model.recommendUsers(df, 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:41 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userTopK").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemTopK").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, 10)
val topKUsersForItems = model.recommendUsers(df, 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:38 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding methods such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-03-16 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15195696#comment-15195696
 ] 

Nick Pentreath edited comment on SPARK-13857 at 3/16/16 6:38 AM:
-

There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{recommendItems}} and {{recommendUsers}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and export the resulting 
predictions DF - so perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1 - requires 3 extra params
val topKItemsForUsers = model.setK(10).setUserTopKCol("userId").transform(df)
val topKUsersForItems = model.setK(10).setItemTopKCol("itemId").transform(df)

// Option 2
val topKItemsForUsers = model.recommendItems(df, "userId", 10)
val topKUsersForItems = model.recommendUsers(df, "itemId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit into the 
{{Transformer}} API, even though it's a little more clunky.


was (Author: mlnick):
There are two broad options for adding this, in terms of ML API:

# Extending {{transform}} to work with additional param(s) to specify whether 
to recommend top-k. 
# Adding a method such as {{predictTopK}}.

I've seen some examples of #2, e.g. in {{LDAModel.describeTopics}}. However 
this seems to fall more naturally into #1, so that it can be part of a 
Pipeline. Having said that, this is likely to be the final stage of a pipeline 
- use model to batch-predict recommendations, and save the resulting DF - so 
perhaps not that important.

e.g.
{code}
val model = ALS.fit(df)
// model has userCol and itemCol set, so calling transform makes predictions 
for each user, item combination
val predictions = model.transform(df)

// Option 1
val topKItemsForUsers = model.setK(10).setTopKCol("userId").transform(df)

// Option 2
val topKItemsForUsers = model.predictTopK("userId", 10)
{code}

[~josephkb] [~mengxr] thoughts? I guess I lean toward #1 to fit more neatly 
into the {{Transformer}} API, even though it's a little more clunky.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13899) Produce InternalRow instead of external Row

2016-03-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13899.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.0.0

> Produce InternalRow instead of external Row
> ---
>
> Key: SPARK-13899
> URL: https://issues.apache.org/jira/browse/SPARK-13899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> Currently CSVRelation.parseCsv produces external {{Row}}.
> As described as a todo to avoid encoding, It would be great if this produces 
> {{InternalRow}} instead of external {{Row}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13926:


Assignee: Apache Spark  (was: Josh Rosen)

> Automatically use Kryo serializer when shuffling RDDs with simple types
> ---
>
> Key: SPARK-13926
> URL: https://issues.apache.org/jira/browse/SPARK-13926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Because ClassTags are available when constructing ShuffledRDD we can use  
> them to automatically use Kryo for shuffle serialization when the RDD's types 
> are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
> combiner types are primitives, arrays of primitives, or strings). This is 
> likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types

2016-03-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13926:


Assignee: Josh Rosen  (was: Apache Spark)

> Automatically use Kryo serializer when shuffling RDDs with simple types
> ---
>
> Key: SPARK-13926
> URL: https://issues.apache.org/jira/browse/SPARK-13926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Because ClassTags are available when constructing ShuffledRDD we can use  
> them to automatically use Kryo for shuffle serialization when the RDD's types 
> are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
> combiner types are primitives, arrays of primitives, or strings). This is 
> likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types

2016-03-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196857#comment-15196857
 ] 

Apache Spark commented on SPARK-13926:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11755

> Automatically use Kryo serializer when shuffling RDDs with simple types
> ---
>
> Key: SPARK-13926
> URL: https://issues.apache.org/jira/browse/SPARK-13926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Because ClassTags are available when constructing ShuffledRDD we can use  
> them to automatically use Kryo for shuffle serialization when the RDD's types 
> are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
> combiner types are primitives, arrays of primitives, or strings). This is 
> likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13920) MIMA checks should apply to @Experimental and @DeveloperAPI APIs

2016-03-16 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13920.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> MIMA checks should apply to @Experimental and @DeveloperAPI APIs
> 
>
> Key: SPARK-13920
> URL: https://issues.apache.org/jira/browse/SPARK-13920
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> Our MIMA binary compatibility checks currently ignore APIs which are marked 
> as {{@Experimental}} or {{@DeveloperApi}}, but I don't think this makes 
> sense. Even if those annotations _reserve_ the right to break binary 
> compatibility, we should still avoid compatibility breaks whenever possible 
> and should be informed explicitly when compatibility breaks.
> As a result, we should update GenerateMIMAIgnore to stop ignoring classes and 
> methods which have those annotations. To remove the ignores, remove the 
> checks from 
> https://github.com/apache/spark/blob/643649dcbfabc5d6952c2ecfb98286324c887665/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala#L43
> After removing the ignores, update {{project/MimaExcludes.scala}} to add 
> exclusions for binary compatibility breaks introduced in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13926) Automatically use Kryo serializer when it is known to be safe

2016-03-16 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-13926:
--

 Summary: Automatically use Kryo serializer when it is known to be 
safe
 Key: SPARK-13926
 URL: https://issues.apache.org/jira/browse/SPARK-13926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Reporter: Josh Rosen
Assignee: Josh Rosen


Because ClassTags are available when constructing ShuffledRDD we can use  them 
to automatically use Kryo for shuffle serialization when the RDD's types are 
guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
combiner types are primitives, arrays of primitives, or strings). This is 
likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13926) Automatically use Kryo serializer when shuffling RDDs with simple types

2016-03-16 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13926:
---
Summary: Automatically use Kryo serializer when shuffling RDDs with simple 
types  (was: Automatically use Kryo serializer when it is known to be safe)

> Automatically use Kryo serializer when shuffling RDDs with simple types
> ---
>
> Key: SPARK-13926
> URL: https://issues.apache.org/jira/browse/SPARK-13926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Because ClassTags are available when constructing ShuffledRDD we can use  
> them to automatically use Kryo for shuffle serialization when the RDD's types 
> are guaranteed to be compatible with Kryo (e.g. RDDs whose key, value, and/or 
> combiner types are primitives, arrays of primitives, or strings). This is 
> likely to result in a large performance gain for many RDD API workloads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

74 matches

Mail list logo