[jira] [Closed] (SPARK-2351) Add Artificial Neural Network (ANN) to Spark
[ https://issues.apache.org/jira/browse/SPARK-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bert Greevenbosch closed SPARK-2351. Resolution: Duplicate Duplicate with SPARK-2352. > Add Artificial Neural Network (ANN) to Spark > > > Key: SPARK-2351 > URL: https://issues.apache.org/jira/browse/SPARK-2351 > Project: Spark > Issue Type: New Feature > Components: MLlib > Environment: MLLIB code >Reporter: Bert Greevenbosch > > It would be good if the Machine Learning Library contained Artificial Neural > Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2352) Add Artificial Neural Network (ANN) to Spark
Bert Greevenbosch created SPARK-2352: Summary: Add Artificial Neural Network (ANN) to Spark Key: SPARK-2352 URL: https://issues.apache.org/jira/browse/SPARK-2352 Project: Spark Issue Type: New Feature Components: MLlib Environment: MLLIB code Reporter: Bert Greevenbosch It would be good if the Machine Learning Library contained Artificial Neural Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2351) Add Artificial Neural Network (ANN) to Spark
Bert Greevenbosch created SPARK-2351: Summary: Add Artificial Neural Network (ANN) to Spark Key: SPARK-2351 URL: https://issues.apache.org/jira/browse/SPARK-2351 Project: Spark Issue Type: New Feature Components: MLlib Environment: MLLIB code Reporter: Bert Greevenbosch It would be good if the Machine Learning Library contained Artificial Neural Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2342) Evaluation helper's output type doesn't conform to input type
[ https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050982#comment-14050982 ] Yijie Shen edited comment on SPARK-2342 at 7/3/14 1:52 AM: --- [~marmbrus], I fix the typo in PR: https://github.com/apache/spark/pull/1283. Please check it, thanks. was (Author: yijieshen): [~marmbrus] I fix the typo in PR: https://github.com/apache/spark/pull/1283. Please check it, thanks. > Evaluation helper's output type doesn't conform to input type > - > > Key: SPARK-2342 > URL: https://issues.apache.org/jira/browse/SPARK-2342 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yijie Shen >Priority: Minor > Labels: easyfix > > In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala > {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: > ((Numeric[Any], Any, Any) => Any)): Any {code} > is intended to do computations for Numeric add/Minus/Multipy. > Just as the comment suggest : {quote}Those expressions are supposed to be in > the same data type, and also the return type.{quote} > But in code, function f was casted to function signature: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code} > I thought it as a typo and the correct should be: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2342) Evaluation helper's output type doesn't conform to input type
[ https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050982#comment-14050982 ] Yijie Shen edited comment on SPARK-2342 at 7/3/14 1:51 AM: --- [~marmbrus] Fix the typo in PR: https://github.com/apache/spark/pull/1283. Please check it, thanks. was (Author: yijieshen): Fix the typo in PR: https://github.com/apache/spark/pull/1283. Please check it, thanks. > Evaluation helper's output type doesn't conform to input type > - > > Key: SPARK-2342 > URL: https://issues.apache.org/jira/browse/SPARK-2342 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yijie Shen >Priority: Minor > Labels: easyfix > > In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala > {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: > ((Numeric[Any], Any, Any) => Any)): Any {code} > is intended to do computations for Numeric add/Minus/Multipy. > Just as the comment suggest : {quote}Those expressions are supposed to be in > the same data type, and also the return type.{quote} > But in code, function f was casted to function signature: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code} > I thought it as a typo and the correct should be: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2342) Evaluation helper's output type doesn't conform to input type
[ https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050982#comment-14050982 ] Yijie Shen edited comment on SPARK-2342 at 7/3/14 1:52 AM: --- [~marmbrus] I fix the typo in PR: https://github.com/apache/spark/pull/1283. Please check it, thanks. was (Author: yijieshen): [~marmbrus] Fix the typo in PR: https://github.com/apache/spark/pull/1283. Please check it, thanks. > Evaluation helper's output type doesn't conform to input type > - > > Key: SPARK-2342 > URL: https://issues.apache.org/jira/browse/SPARK-2342 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yijie Shen >Priority: Minor > Labels: easyfix > > In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala > {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: > ((Numeric[Any], Any, Any) => Any)): Any {code} > is intended to do computations for Numeric add/Minus/Multipy. > Just as the comment suggest : {quote}Those expressions are supposed to be in > the same data type, and also the return type.{quote} > But in code, function f was casted to function signature: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code} > I thought it as a typo and the correct should be: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2342) Evaluation helper's output type doesn't conform to input type
[ https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050982#comment-14050982 ] Yijie Shen commented on SPARK-2342: --- Fix the typo in PR: https://github.com/apache/spark/pull/1283. Please check it, thanks. > Evaluation helper's output type doesn't conform to input type > - > > Key: SPARK-2342 > URL: https://issues.apache.org/jira/browse/SPARK-2342 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yijie Shen >Priority: Minor > Labels: easyfix > > In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala > {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: > ((Numeric[Any], Any, Any) => Any)): Any {code} > is intended to do computations for Numeric add/Minus/Multipy. > Just as the comment suggest : {quote}Those expressions are supposed to be in > the same data type, and also the return type.{quote} > But in code, function f was casted to function signature: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code} > I thought it as a typo and the correct should be: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack
[ https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050952#comment-14050952 ] Rui Li commented on SPARK-2277: --- PR created at: https://github.com/apache/spark/pull/1212 > Make TaskScheduler track whether there's host on a rack > --- > > Key: SPARK-2277 > URL: https://issues.apache.org/jira/browse/SPARK-2277 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > > When TaskSetManager adds a pending task, it checks whether the tasks's > preferred location is available. Regarding RACK_LOCAL task, we consider the > preferred rack available if such a rack is defined for the preferred host. > This is incorrect as there may be no alive hosts on that rack at all. > Therefore, TaskScheduler should track the hosts on each rack, and provides an > API for TaskSetManager to check if there's host alive on a specific rack. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack
[ https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050951#comment-14050951 ] Rui Li commented on SPARK-2277: --- Suppose task1 prefers node1 but node1 is not available at the moment. However, we know node1 is on rack1, which makes task1 prefers rack1 for RACK_LOCAL locality. The problem is, we don't know if there's alive host on rack1, so we cannot check the availability of this preference. Please let me know if I misunderstand anything :) > Make TaskScheduler track whether there's host on a rack > --- > > Key: SPARK-2277 > URL: https://issues.apache.org/jira/browse/SPARK-2277 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > > When TaskSetManager adds a pending task, it checks whether the tasks's > preferred location is available. Regarding RACK_LOCAL task, we consider the > preferred rack available if such a rack is defined for the preferred host. > This is incorrect as there may be no alive hosts on that rack at all. > Therefore, TaskScheduler should track the hosts on each rack, and provides an > API for TaskSetManager to check if there's host alive on a specific rack. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2350) Master throws NPE
[ https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050894#comment-14050894 ] Andrew Or commented on SPARK-2350: -- This is the root cause of SPARK-2154 > Master throws NPE > - > > Key: SPARK-2350 > URL: https://issues.apache.org/jira/browse/SPARK-2350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > Fix For: 1.1.0 > > > ... if we launch a driver and there are more waiting drivers to be launched. > This is because we remove from a list while iterating through this. > Here is the culprit from Master.scala (L487 as of the creation of this JIRA, > commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c). > {code} > for (driver <- waitingDrivers) { > if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= > driver.desc.cores) { > launchDriver(worker, driver) > waitingDrivers -= driver > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2350) Master throws NPE
[ https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050891#comment-14050891 ] Andrew Or commented on SPARK-2350: -- In general, if Master dies because of an exception, it automatically restarts and the exception message is hidden in the logs. It took a while for [~ilikerps] and I to find the exception as we are scrolling through the logs. > Master throws NPE > - > > Key: SPARK-2350 > URL: https://issues.apache.org/jira/browse/SPARK-2350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > Fix For: 1.1.0 > > > ... if we launch a driver and there are more waiting drivers to be launched. > This is because we remove from a list while iterating through this. > Here is the culprit from Master.scala (L487 as of the creation of this JIRA, > commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c). > {code} > for (driver <- waitingDrivers) { > if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= > driver.desc.cores) { > launchDriver(worker, driver) > waitingDrivers -= driver > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2350) Master throws NPE
[ https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050891#comment-14050891 ] Andrew Or edited comment on SPARK-2350 at 7/3/14 12:07 AM: --- In general, if Master dies because of an exception, it automatically restarts and the exception message is hidden in the logs. In the mean time, the symptoms are not indicative of a Master having thrown an exception and restarted. It took a while for [~ilikerps] and I to find the exception as we were scrolling through the logs. was (Author: andrewor): In general, if Master dies because of an exception, it automatically restarts and the exception message is hidden in the logs. It took a while for [~ilikerps] and I to find the exception as we are scrolling through the logs. > Master throws NPE > - > > Key: SPARK-2350 > URL: https://issues.apache.org/jira/browse/SPARK-2350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > Fix For: 1.1.0 > > > ... if we launch a driver and there are more waiting drivers to be launched. > This is because we remove from a list while iterating through this. > Here is the culprit from Master.scala (L487 as of the creation of this JIRA, > commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c). > {code} > for (driver <- waitingDrivers) { > if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= > driver.desc.cores) { > launchDriver(worker, driver) > waitingDrivers -= driver > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2350) Master throws NPE
[ https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2350: - Description: ... if we launch a driver and there are more waiting drivers to be launched. This is because we remove from a list while iterating through this. Here is the culprit from Master.scala (L487 as of the creation of this JIRA, commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c). {code} for (driver <- waitingDrivers) { if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver } } {code} was: ... if we launch a driver and there are more waiting drivers to be launched. This is because we remove from a list while iterating through this. {code} for (driver <- waitingDrivers) { if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver } } {code} > Master throws NPE > - > > Key: SPARK-2350 > URL: https://issues.apache.org/jira/browse/SPARK-2350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > Fix For: 1.1.0 > > > ... if we launch a driver and there are more waiting drivers to be launched. > This is because we remove from a list while iterating through this. > Here is the culprit from Master.scala (L487 as of the creation of this JIRA, > commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c). > {code} > for (driver <- waitingDrivers) { > if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= > driver.desc.cores) { > launchDriver(worker, driver) > waitingDrivers -= driver > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2350) Master throws NPE
[ https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2350: - Description: ... if we launch a driver and there are more waiting drivers to be launched. This is because we remove from a list while iterating through this. {code} for (driver <- waitingDrivers) { if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver } } {code} was:... if we launch a driver and there are more waiting drivers to be launched. This is because we remove from a list while iterating through this. > Master throws NPE > - > > Key: SPARK-2350 > URL: https://issues.apache.org/jira/browse/SPARK-2350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > Fix For: 1.1.0 > > > ... if we launch a driver and there are more waiting drivers to be launched. > This is because we remove from a list while iterating through this. > {code} > for (driver <- waitingDrivers) { > if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= > driver.desc.cores) { > launchDriver(worker, driver) > waitingDrivers -= driver > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2350) Master throws NPE
Andrew Or created SPARK-2350: Summary: Master throws NPE Key: SPARK-2350 URL: https://issues.apache.org/jira/browse/SPARK-2350 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Fix For: 1.1.0 ... if we launch a driver and there are more waiting drivers to be launched. This is because we remove from a list while iterating through this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2350) Master throws NPE
[ https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2350: - Description: ... if we launch a driver and there are more waiting drivers to be launched. This is because we remove from a list while iterating through this. {code} for (driver <- waitingDrivers) { if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver } } {code} was: ... if we launch a driver and there are more waiting drivers to be launched. This is because we remove from a list while iterating through this. {code} for (driver <- waitingDrivers) { if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver } } {code} > Master throws NPE > - > > Key: SPARK-2350 > URL: https://issues.apache.org/jira/browse/SPARK-2350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > Fix For: 1.1.0 > > > ... if we launch a driver and there are more waiting drivers to be launched. > This is because we remove from a list while iterating through this. > {code} > for (driver <- waitingDrivers) { > if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= > driver.desc.cores) { > launchDriver(worker, driver) > waitingDrivers -= driver > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack
[ https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050886#comment-14050886 ] Mridul Muralidharan commented on SPARK-2277: I am not sure I follow this requirement. For preferred locations, we populate their corresponding racks (if available) as preferred rack. For available executors hosts, we lookup the rack they belong to - and then see if that rack is preferred or not. This, ofcourse, assumes a host is only on a single rack. What exactly is the behavior you are expecting from scheduler ? > Make TaskScheduler track whether there's host on a rack > --- > > Key: SPARK-2277 > URL: https://issues.apache.org/jira/browse/SPARK-2277 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > > When TaskSetManager adds a pending task, it checks whether the tasks's > preferred location is available. Regarding RACK_LOCAL task, we consider the > preferred rack available if such a rack is defined for the preferred host. > This is incorrect as there may be no alive hosts on that rack at all. > Therefore, TaskScheduler should track the hosts on each rack, and provides an > API for TaskSetManager to check if there's host alive on a specific rack. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap
Andrew Or created SPARK-2349: Summary: Fix NPE in ExternalAppendOnlyMap Key: SPARK-2349 URL: https://issues.apache.org/jira/browse/SPARK-2349 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or It throws an NPE on null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1614) Move Mesos protobufs out of TaskState
[ https://issues.apache.org/jira/browse/SPARK-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050804#comment-14050804 ] Martin Zapletal commented on SPARK-1614: I am considering moving the protobufs to a new object - something like object org.apache.spark.MesosTaskState. Is that acceptable solution with regards to the requirements (to avoid the conflicts)? If not, can you please suggest which place would be the best for it? > Move Mesos protobufs out of TaskState > - > > Key: SPARK-1614 > URL: https://issues.apache.org/jira/browse/SPARK-1614 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 0.9.1 >Reporter: Shivaram Venkataraman >Priority: Minor > Labels: Starter > > To isolate usage of Mesos protobufs it would be good to move them out of > TaskState into either a new class (MesosUtils ?) or > CoarseGrainedMesos{Executor, Backend}. > This would allow applications to build Spark to run without including > protobuf from Mesos in their shaded jars. This is one way to avoid protobuf > conflicts between Mesos and Hadoop > (https://issues.apache.org/jira/browse/MESOS-1203) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2346) Error parsing table names that starts with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Albul updated SPARK-2346: --- Summary: Error parsing table names that starts with numbers (was: Error parsing table names that starts from numbers) > Error parsing table names that starts with numbers > -- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul > Labels: Parser, SQL > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error
[ https://issues.apache.org/jira/browse/SPARK-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050757#comment-14050757 ] Chirag Todarka commented on SPARK-2348: --- [~pwendell] [~cheffpj] Hi Patrick/Pat, I am new to the project and want to contribute in this. I hope this will be a great starting point for me so please if possible assign it to me. Regards, Chirag Todarka > In Windows having a enviorinment variable named 'classpath' gives error > --- > > Key: SPARK-2348 > URL: https://issues.apache.org/jira/browse/SPARK-2348 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: Windows 7 Enterprise >Reporter: Chirag Todarka > > Operating System:: Windows 7 Enterprise > If having enviorinment variable named 'classpath' gives then starting > 'spark-shell' gives below error:: > \spark\bin>spark-shell > Failed to initialize compiler: object scala.runtime in compiler mirror not > found > . > ** Note that as of 2.8 scala does not assume use of the java classpath. > ** For the old behavior pass -usejavacp to scala, or if using a Settings > ** object programatically, settings.usejavacp.value = true. > 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler > acces > sed before init set up. Assuming no postInit code. > Failed to initialize compiler: object scala.runtime in compiler mirror not > found > . > ** Note that as of 2.8 scala does not assume use of the java classpath. > ** For the old behavior pass -usejavacp to scala, or if using a Settings > ** object programatically, settings.usejavacp.value = true. > Exception in thread "main" java.lang.AssertionError: assertion failed: null > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca > la:202) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar > kILoop.scala:929) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. > scala:884) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. > scala:884) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass > Loader.scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) > at java.lang.reflect.Method.invoke(Unknown Source) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1305) Support persisting RDD's directly to Tachyon
[ https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henry Saputra updated SPARK-1305: - Comment: was deleted (was: Never mind, Found it, it was when Spark in incubtor) > Support persisting RDD's directly to Tachyon > > > Key: SPARK-1305 > URL: https://issues.apache.org/jira/browse/SPARK-1305 > Project: Spark > Issue Type: New Feature > Components: Block Manager >Reporter: Patrick Wendell >Assignee: Haoyuan Li >Priority: Blocker > Fix For: 1.0.0 > > > This is already an ongoing pull request - in a nutshell we want to support > Tachyon as a storage level in Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1305) Support persisting RDD's directly to Tachyon
[ https://issues.apache.org/jira/browse/SPARK-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henry Saputra updated SPARK-1305: - Comment: was deleted (was: Sorry to comment on old JIRA but does anyone have PR for this ticket?) > Support persisting RDD's directly to Tachyon > > > Key: SPARK-1305 > URL: https://issues.apache.org/jira/browse/SPARK-1305 > Project: Spark > Issue Type: New Feature > Components: Block Manager >Reporter: Patrick Wendell >Assignee: Haoyuan Li >Priority: Blocker > Fix For: 1.0.0 > > > This is already an ongoing pull request - in a nutshell we want to support > Tachyon as a storage level in Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2348) In Windows having a enviorinment variable named 'classpath' gives error
Chirag Todarka created SPARK-2348: - Summary: In Windows having a enviorinment variable named 'classpath' gives error Key: SPARK-2348 URL: https://issues.apache.org/jira/browse/SPARK-2348 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: Windows 7 Enterprise Reporter: Chirag Todarka Operating System:: Windows 7 Enterprise If having enviorinment variable named 'classpath' gives then starting 'spark-shell' gives below error:: \spark\bin>spark-shell Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. 14/07/02 14:22:06 WARN SparkILoop$SparkILoopInterpreter: Warning: compiler acces sed before init set up. Assuming no postInit code. Failed to initialize compiler: object scala.runtime in compiler mirror not found . ** Note that as of 2.8 scala does not assume use of the java classpath. ** For the old behavior pass -usejavacp to scala, or if using a Settings ** object programatically, settings.usejavacp.value = true. Exception in thread "main" java.lang.AssertionError: assertion failed: null at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.sca la:202) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(Spar kILoop.scala:929) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop. scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClass Loader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050721#comment-14050721 ] Yin Huai commented on SPARK-2339: - Also, names of those registered tables are case sensitive. But, names of Hive tables are case insensitive. It will cause confusion when a user using HiveContext. I think it may be good to treat all identifiers case insensitive when a user is using HiveContext and make HiveContext.sql as a alias of HiveContext.hql (basically do not expose catalyst's SQLParser in HiveContext). > SQL parser in sql-core is case sensitive, but a table alias is converted to > lower case when we create Subquery > -- > > Key: SPARK-2339 > URL: https://issues.apache.org/jira/browse/SPARK-2339 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Yin Huai > Fix For: 1.1.0 > > > Reported by > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html > After we get the table from the catalog, because the table has an alias, we > will temporarily insert a Subquery. Then, we convert the table alias to lower > case no matter if the parser is case sensitive or not. > To see the issue ... > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Person(name: String, age: Int) > val people = > sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p > => Person(p(0), p(1).trim.toInt)) > people.registerAsTable("people") > sqlContext.sql("select PEOPLE.name from people PEOPLE") > {code} > The plan is ... > {code} > == Query Plan == > Project ['PEOPLE.name] > ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at > basicOperators.scala:176 > {code} > You can find that "PEOPLE.name" is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2347) Graph object can not be set to StorageLevel.MEMORY_ONLY_SER
Baoxu Shi created SPARK-2347: Summary: Graph object can not be set to StorageLevel.MEMORY_ONLY_SER Key: SPARK-2347 URL: https://issues.apache.org/jira/browse/SPARK-2347 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.0.0 Environment: Spark standalone with 5 workers and 1 driver Reporter: Baoxu Shi I'm creating Graph object by using Graph(vertices, edges, null, StorageLevel.MEMORY_ONLY, StorageLevel.MEMORY_ONLY) But that will throw out not serializable exception on both workers and driver. 14/07/02 16:30:26 ERROR BlockManagerWorker: Exception handling buffer message java.io.NotSerializableException: org.apache.spark.graphx.impl.VertexPartition at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42) at org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:106) at org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:30) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:988) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:997) at org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:392) at org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:358) at org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) at org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) at org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) at org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:662) at org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:504) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Even if the driver sometime does not throw this exception, it will throw java.io.FileNotFoundException: /tmp/spark-local-20140702151845-9620/2a/shuffle_2_25_3 (No such file or directory) I know that VertexPartition not supposed to be serializable, so is there any workaround on this? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2346) Error parsing table names that starts from numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Albul updated SPARK-2346: --- Description: Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names when they start from numbers. Steps to reproduce: {code:title=Test.scala|borderStyle=solid} case class Data(value: String) object Test { def main(args: Array[String]) { val sc = new SparkContext("local", "sql") val sqlSc = new SQLContext(sc) import sqlSc._ sc.parallelize(List(Data("one"), Data("two"))).registerAsTable("123_table") sql("SELECT * FROM '123_table'").collect().foreach(println) } } {code} And here is an exception: {quote} Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' expected but "123_table" found SELECT * FROM '123_table' ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) at io.ubix.spark.Test$.main(Test.scala:24) at io.ubix.spark.Test.main(Test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {quote} When i am changing from 123_table to table_123 problem disappears. was: Looks like org.apache.spark.sql.catalyst.SqlParser cannot parse table names when they start from numbers. Steps to reproduce: {code:title=Test.scala|borderStyle=solid} case class Data(value: String) object Test { def main(args: Array[String]) { val sc = new SparkContext("local", "sql") val sqlSc = new SQLContext(sc) import sqlSc._ sc.parallelize(List(Data("one"), Data("two"))).registerAsTable("123_table") sql("SELECT * FROM '123_table'").collect().foreach(println) } } {code} And here is an exception: {quote} Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' expected but "123_table" found SELECT * FROM '123_table' ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) at io.ubix.spark.Test$.main(Test.scala:24) at io.ubix.spark.Test.main(Test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {quote} When i am changing from 123_table to table_123 problem disappears. > Error parsing table names that starts from numbers > -- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul > Labels: Parser, SQL > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > a
[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
[ https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050670#comment-14050670 ] Hari Shreedharan commented on SPARK-2345: - Looks like we'd have to do this in a new DStream, since the ForEachDStream takes a (RDD[T], Time)=> Unit, but to call runJob we'd have to pass in (Iterator[T], Time)=>Unit. I am not sure how much value this adds, but it does seem like if we are not using one of the built-in save/collect methods, you'd have to specifically run this function in context.runJob(...) Do you think this makes sense, [~tdas], [~pwendell]? > ForEachDStream should have an option of running the foreachfunc on Spark > > > Key: SPARK-2345 > URL: https://issues.apache.org/jira/browse/SPARK-2345 > Project: Spark > Issue Type: Bug >Reporter: Hari Shreedharan > > Today the Job generated simply calls the foreachfunc, but does not run it on > spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2346) Error parsing table names that starts from numbers
Alexander Albul created SPARK-2346: -- Summary: Error parsing table names that starts from numbers Key: SPARK-2346 URL: https://issues.apache.org/jira/browse/SPARK-2346 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Alexander Albul Looks like org.apache.spark.sql.catalyst.SqlParser cannot parse table names when they start from numbers. Steps to reproduce: {code:title=Test.scala|borderStyle=solid} case class Data(value: String) object Test { def main(args: Array[String]) { val sc = new SparkContext("local", "sql") val sqlSc = new SQLContext(sc) import sqlSc._ sc.parallelize(List(Data("one"), Data("two"))).registerAsTable("123_table") sql("SELECT * FROM '123_table'").collect().foreach(println) } } {code} And here is an exception: {quote} Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' expected but "123_table" found SELECT * FROM '123_table' ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) at io.ubix.spark.Test$.main(Test.scala:24) at io.ubix.spark.Test.main(Test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {quote} When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
[ https://issues.apache.org/jira/browse/SPARK-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050659#comment-14050659 ] Hari Shreedharan commented on SPARK-2345: - Currently, the job (like saveAsTextFile or saveAsHadoopFile) on the DStream will cause the rdd.save calls to be executed on sparkContext.runJob, which in turn will call the foreachfunc which is passed to the ForEachDStream. So a case where this DStream is saved off works fine. But if you simply do a register and have the foreachfunc do some processing and custom writes may cause the application to be run locally. > ForEachDStream should have an option of running the foreachfunc on Spark > > > Key: SPARK-2345 > URL: https://issues.apache.org/jira/browse/SPARK-2345 > Project: Spark > Issue Type: Bug >Reporter: Hari Shreedharan > > Today the Job generated simply calls the foreachfunc, but does not run it on > spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2345) ForEachDStream should have an option of running the foreachfunc on Spark
Hari Shreedharan created SPARK-2345: --- Summary: ForEachDStream should have an option of running the foreachfunc on Spark Key: SPARK-2345 URL: https://issues.apache.org/jira/browse/SPARK-2345 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan Today the Job generated simply calls the foreachfunc, but does not run it on spark itself using the sparkContext.runJob method. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib
Alex created SPARK-2344: --- Summary: Add Fuzzy C-Means algorithm to MLlib Key: SPARK-2344 URL: https://issues.apache.org/jira/browse/SPARK-2344 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Alex I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. FCM is very similar to K - Means which is already implemented, and they differ only in the degree of relationship each point has with each cluster: (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. As part of the implementation I would like: - create a base class for K- Means and FCM - implement the relationship for each algorithm differently (in its class) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1054) Get Cassandra support in Spark Core/Spark Cassandra Module
[ https://issues.apache.org/jira/browse/SPARK-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050544#comment-14050544 ] Rohit Rai commented on SPARK-1054: -- With the https://github.com/datastax/cassandra-driver-spark from Datastax, we should work on getting a united standard API in Spark, getting good things from both worlds. > Get Cassandra support in Spark Core/Spark Cassandra Module > -- > > Key: SPARK-1054 > URL: https://issues.apache.org/jira/browse/SPARK-1054 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Rohit Rai > Labels: calliope, cassandra > > Calliope is a library providing an interface to consume data from Cassandra > to spark and store RDDs from Spark to Cassandra. > Building as wrapper over Cassandra's Hadoop I/O it provides a simplified and > very generic API to consume and produces data from and to Cassandra. It > allows you to consume data from Legacy as well as CQL3 Cassandra Storage. It > can also harness C* to speed up your process by fetching only the relevant > data from C* harnessing CQL3 and C*'s secondary indexes. Though it currently > uses only the Hadoop I/O formats for Cassandra in near future we see the same > API harnessing other means of consuming Cassandra data like using the > StorageProxy or even reading from SSTables directly. > Over the basic data fetch functionality, the Calliope API harnesses Scala and > it's implicit parameters and conversions for you to work on a higher > abstraction dealing with tuples/objects instead of Cassandra's Row/Columns in > your MapRed jobs. > Over past few months we have seen the combination of Spark+Cassandra gaining > a lot of traction. And we feel Calliope provides the path of least friction > for developers to start working with this combination. > We have been using this ins production for over a year now and the Calliope > early access repository has 30+ users. I am putting this issue to start a > discussion around whether we would want Calliope to be a part of Spark and if > yes, what will be involved in doing so. > You can read more about Calliope here - > http://tuplejump.github.io/calliope -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1054) Get Cassandra support in Spark Core/Spark Cassandra Module
[ https://issues.apache.org/jira/browse/SPARK-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Rai updated SPARK-1054: - Summary: Get Cassandra support in Spark Core/Spark Cassandra Module (was: Contribute Calliope Core to Spark as spark-cassandra) > Get Cassandra support in Spark Core/Spark Cassandra Module > -- > > Key: SPARK-1054 > URL: https://issues.apache.org/jira/browse/SPARK-1054 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Rohit Rai > Labels: calliope, cassandra > > Calliope is a library providing an interface to consume data from Cassandra > to spark and store RDDs from Spark to Cassandra. > Building as wrapper over Cassandra's Hadoop I/O it provides a simplified and > very generic API to consume and produces data from and to Cassandra. It > allows you to consume data from Legacy as well as CQL3 Cassandra Storage. It > can also harness C* to speed up your process by fetching only the relevant > data from C* harnessing CQL3 and C*'s secondary indexes. Though it currently > uses only the Hadoop I/O formats for Cassandra in near future we see the same > API harnessing other means of consuming Cassandra data like using the > StorageProxy or even reading from SSTables directly. > Over the basic data fetch functionality, the Calliope API harnesses Scala and > it's implicit parameters and conversions for you to work on a higher > abstraction dealing with tuples/objects instead of Cassandra's Row/Columns in > your MapRed jobs. > Over past few months we have seen the combination of Spark+Cassandra gaining > a lot of traction. And we feel Calliope provides the path of least friction > for developers to start working with this combination. > We have been using this ins production for over a year now and the Calliope > early access repository has 30+ users. I am putting this issue to start a > discussion around whether we would want Calliope to be a part of Spark and if > yes, what will be involved in doing so. > You can read more about Calliope here - > http://tuplejump.github.io/calliope -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack
[ https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050381#comment-14050381 ] Chen He commented on SPARK-2277: This is interesting. I will take a look. > Make TaskScheduler track whether there's host on a rack > --- > > Key: SPARK-2277 > URL: https://issues.apache.org/jira/browse/SPARK-2277 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > > When TaskSetManager adds a pending task, it checks whether the tasks's > preferred location is available. Regarding RACK_LOCAL task, we consider the > preferred rack available if such a rack is defined for the preferred host. > This is incorrect as there may be no alive hosts on that rack at all. > Therefore, TaskScheduler should track the hosts on each rack, and provides an > API for TaskSetManager to check if there's host alive on a specific rack. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2342) Evaluation helper's output type doesn't conform to input type
[ https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050347#comment-14050347 ] Michael Armbrust commented on SPARK-2342: - This does look like a typo (though maybe one that doesn't matter due to erasure?). That said, if you make a PR I'll certainly merge it. Thanks! > Evaluation helper's output type doesn't conform to input type > - > > Key: SPARK-2342 > URL: https://issues.apache.org/jira/browse/SPARK-2342 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yijie Shen >Priority: Minor > Labels: easyfix > > In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala > {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: > ((Numeric[Any], Any, Any) => Any)): Any {code} > is intended to do computations for Numeric add/Minus/Multipy. > Just as the comment suggest : {quote}Those expressions are supposed to be in > the same data type, and also the return type.{quote} > But in code, function f was casted to function signature: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code} > I thought it as a typo and the correct should be: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2287) Make ScalaReflection be able to handle Generic case classes.
[ https://issues.apache.org/jira/browse/SPARK-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2287. - Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Assignee: Takuya Ueshin > Make ScalaReflection be able to handle Generic case classes. > > > Key: SPARK-2287 > URL: https://issues.apache.org/jira/browse/SPARK-2287 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.0.1, 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2328) Add execution of `SHOW TABLES` before `TestHive.reset()`.
[ https://issues.apache.org/jira/browse/SPARK-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2328. - Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Assignee: Takuya Ueshin > Add execution of `SHOW TABLES` before `TestHive.reset()`. > - > > Key: SPARK-2328 > URL: https://issues.apache.org/jira/browse/SPARK-2328 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 1.0.1, 1.1.0 > > > {{PruningSuite}} is executed first of Hive tests unfortunately, > {{TestHive.reset()}} breaks the test environment. > To prevent this, we must run a query before calling reset the first time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2186) Spark SQL DSL support for simple aggregations such as SUM and AVG
[ https://issues.apache.org/jira/browse/SPARK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2186. - Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 > Spark SQL DSL support for simple aggregations such as SUM and AVG > - > > Key: SPARK-2186 > URL: https://issues.apache.org/jira/browse/SPARK-2186 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Zongheng Yang >Priority: Minor > Fix For: 1.0.1, 1.1.0 > > > Inspired by this thread > (http://apache-spark-user-list.1001560.n3.nabble.com/Patterns-for-making-multiple-aggregations-in-one-pass-td7874.html): > Spark SQL doesn't seem to have DSL support for simple aggregations such as > AVG and SUM. It'd be nice if the user could avoid writing a SQL query and > instead write something like: > {code} > data.select('country, 'age.avg, 'hits.sum).groupBy('country).collect() > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1850) Bad exception if multiple jars exist when running PySpark
[ https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-1850. Resolution: Fixed > Bad exception if multiple jars exist when running PySpark > - > > Key: SPARK-1850 > URL: https://issues.apache.org/jira/browse/SPARK-1850 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Andrew Or > Fix For: 1.0.1 > > > {code} > Found multiple Spark assembly jars in > /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: > Traceback (most recent call last): > File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py", > line 43, in > sc = SparkContext(os.environ.get("MASTER", "local[*]"), "PySparkShell", > pyFiles=add_files) > File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", > line 94, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", > line 180, in _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File > "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py", > line 49, in launch_gateway > gateway_port = int(proc.stdout.readline()) > ValueError: invalid literal for int() with base 10: > 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' > {code} > It's trying to read the Java gateway port as an int from the sub-process' > STDOUT. However, what it read was an error message, which is clearly not an > int. We should differentiate between these cases and just propagate the > original message if it's not an int. Right now, this exception is not very > helpful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1850) Bad exception if multiple jars exist when running PySpark
[ https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050318#comment-14050318 ] Andrew Or commented on SPARK-1850: -- Ye, I will change it. > Bad exception if multiple jars exist when running PySpark > - > > Key: SPARK-1850 > URL: https://issues.apache.org/jira/browse/SPARK-1850 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Andrew Or > Fix For: 1.0.1 > > > {code} > Found multiple Spark assembly jars in > /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: > Traceback (most recent call last): > File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py", > line 43, in > sc = SparkContext(os.environ.get("MASTER", "local[*]"), "PySparkShell", > pyFiles=add_files) > File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", > line 94, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", > line 180, in _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File > "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py", > line 49, in launch_gateway > gateway_port = int(proc.stdout.readline()) > ValueError: invalid literal for int() with base 10: > 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' > {code} > It's trying to read the Java gateway port as an int from the sub-process' > STDOUT. However, what it read was an error message, which is clearly not an > int. We should differentiate between these cases and just propagate the > original message if it's not an int. Right now, this exception is not very > helpful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2343) QueueInputDStream with oneAtATime=false does not dequeue items
Manuel Laflamme created SPARK-2343: -- Summary: QueueInputDStream with oneAtATime=false does not dequeue items Key: SPARK-2343 URL: https://issues.apache.org/jira/browse/SPARK-2343 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0, 0.9.1, 0.9.0 Reporter: Manuel Laflamme Priority: Minor QueueInputDStream does not dequeue items when used with the oneAtATime flag disabled. The same items are reprocessed for every batch. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1989) Exit executors faster if they get into a cycle of heavy GC
[ https://issues.apache.org/jira/browse/SPARK-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050005#comment-14050005 ] Guoqiang Li commented on SPARK-1989: In this case should also triggers the driver garbage collection. The related work: https://github.com/witgo/spark/compare/taskEvent > Exit executors faster if they get into a cycle of heavy GC > -- > > Key: SPARK-1989 > URL: https://issues.apache.org/jira/browse/SPARK-1989 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia > Fix For: 1.1.0 > > > I've seen situations where an application is allocating too much memory > across its tasks + cache to proceed, but Java gets into a cycle where it > repeatedly runs full GCs, frees up a bit of the heap, and continues instead > of giving up. This then leads to timeouts and confusing error messages. It > would be better to crash with OOM sooner. The JVM has options to support > this: http://java.dzone.com/articles/tracking-excessive-garbage. > The right solution would probably be: > - Add some config options used by spark-submit to set XX:GCTimeLimit and > XX:GCHeapFreeLimit, with more conservative values than the defaults (e.g. 90% > time limit, 5% free limit) > - Make sure we pass these into the Java options for executors in each > deployment mode -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049942#comment-14049942 ] Sean Owen commented on SPARK-2341: -- I've been a bit uncomfortable with how the MLlib API conflates categorical values and numbers, since they aren't numbers in general. Treating them as numbers is a convenience in some cases, and common in papers, but feels like suboptimal software design -- should a user have to convert categoricals to some numeric representation? To me it invites confusion, and this is one symptom. So I am not sure "multiclass" should mean "parse target as double" to begin with? OK, it's not the issue here. But we're on the subject of an experimental API subject to change with an example of something related that could be improved along the way, and it's my #1 wish for MLlib at the moment. I'd really like to work on a change to try to accommodate classes as, say, strings at least, and not presume doubles. But I am trying to figure out if anyone agrees with that. > loadLibSVMFile doesn't handle regression datasets > - > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Eustache >Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets
[ https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049939#comment-14049939 ] Alexander Ulanov commented on SPARK-1473: - Does anybody work on this issue? > Feature selection for high dimensional datasets > --- > > Key: SPARK-1473 > URL: https://issues.apache.org/jira/browse/SPARK-1473 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ignacio Zendejas >Priority: Minor > Labels: features > Fix For: 1.1.0 > > > For classification tasks involving large feature spaces in the order of tens > of thousands or higher (e.g., text classification with n-grams, where n > 1), > it is often useful to rank and filter features that are irrelevant thereby > reducing the feature space by at least one or two orders of magnitude without > impacting performance on key evaluation metrics (accuracy/precision/recall). > A feature evaluation interface which is flexible needs to be designed and at > least two methods should be implemented with Information Gain being a > priority as it has been shown to be amongst the most reliable. > Special consideration should be taken in the design to account for wrapper > methods (see research papers below) which are more practical for lower > dimensional data. > Relevant research: > * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional > likelihood maximisation: a unifying framework for information theoretic > feature selection.*The Journal of Machine Learning Research*, *13*, 27-66. > * Forman, George. "An extensive empirical study of feature selection metrics > for text classification." The Journal of machine learning research 3 (2003): > 1289-1305. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1284) pyspark hangs after IOError on Executor
[ https://issues.apache.org/jira/browse/SPARK-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049937#comment-14049937 ] Matthew Farrellee commented on SPARK-1284: -- [~jblomo] - will you add a reproducer script to this issue? i did a simple test based on what you suggested w/ the tip of master and could not reproduce - {code} $ ./dist/bin/pyspark Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. ... Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.0.0-SNAPSHOT /_/ Using Python version 2.7.5 (default, Feb 19 2014 13:47:28) SparkContext available as sc. >>> data = sc.textFile('/etc/passwd') 14/07/02 07:03:59 INFO MemoryStore: ensureFreeSpace(32816) called with curMem=0, maxMem=308910489 14/07/02 07:03:59 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 32.0 KB, free 294.6 MB) >>> data.cache() /etc/passwd MappedRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 >>> data.take(10) ...[expected output]... >>> data.flatMap(lambda line: line.split(':')).map(lambda word: (word, >>> 1)).reduceByKey(lambda x, y: x + y).collect() ...[expected output, no hang]... {code} > pyspark hangs after IOError on Executor > --- > > Key: SPARK-1284 > URL: https://issues.apache.org/jira/browse/SPARK-1284 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Jim Blomo > > When running a reduceByKey over a cached RDD, Python fails with an exception, > but the failure is not detected by the task runner. Spark and the pyspark > shell hang waiting for the task to finish. > The error is: > {code} > PySpark worker failed with exception: > Traceback (most recent call last): > File "/home/hadoop/spark/python/pyspark/worker.py", line 77, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File "/home/hadoop/spark/python/pyspark/serializers.py", line 182, in > dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File "/home/hadoop/spark/python/pyspark/serializers.py", line 118, in > dump_stream > self._write_with_length(obj, stream) > File "/home/hadoop/spark/python/pyspark/serializers.py", line 130, in > _write_with_length > stream.write(serialized) > IOError: [Errno 104] Connection reset by peer > 14/03/19 22:48:15 INFO scheduler.TaskSetManager: Serialized task 4.0:0 as > 4257 bytes in 47 ms > Traceback (most recent call last): > File "/home/hadoop/spark/python/pyspark/daemon.py", line 117, in > launch_worker > worker(listen_sock) > File "/home/hadoop/spark/python/pyspark/daemon.py", line 107, in worker > outfile.flush() > IOError: [Errno 32] Broken pipe > {code} > I can reproduce the error by running take(10) on the cached RDD before > running reduceByKey (which looks at the whole input file). > Affects Version 1.0.0-SNAPSHOT (4d88030486) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1030) unneeded file required when running pyspark program using yarn-client
[ https://issues.apache.org/jira/browse/SPARK-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049929#comment-14049929 ] Matthew Farrellee commented on SPARK-1030: -- using pyspark to submit is deprecated in spark 1.0 in favor of spark-submit. i think this should be closed as resolved/workfix. /cc: [~pwendell] [~joshrosen] > unneeded file required when running pyspark program using yarn-client > - > > Key: SPARK-1030 > URL: https://issues.apache.org/jira/browse/SPARK-1030 > Project: Spark > Issue Type: Bug > Components: Deploy, PySpark, YARN >Affects Versions: 0.8.1 >Reporter: Diana Carroll >Assignee: Josh Rosen > > I can successfully run a pyspark program using the yarn-client master using > the following command: > {code} > SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar > \ > SPARK_YARN_APP_JAR=~/testdata.txt pyspark \ > test1.py > {code} > However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python > program, and therefore there's no JAR. If I don't set the value, or if I set > the value to a non-existent files, Spark gives me an error message. > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46) > {code} > or > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.io.FileNotFoundException: File file:dummy.txt does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) > {code} > My program is very simple: > {code} > from pyspark import SparkContext > def main(): > sc = SparkContext("yarn-client", "Simple App") > logData = > sc.textFile("hdfs://localhost/user/training/weblogs/2013-09-15.log") > numjpgs = logData.filter(lambda s: '.jpg' in s).count() > print "Number of JPG requests: " + str(numjpgs) > {code} > Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at > all; I can point it at anything, as long as it's a valid, accessible file, > and it works the same. > Although there's an obvious workaround for this bug, it's high priority from > my perspective because I'm working on a course to teach people how to do > this, and it's really hard to explain why this variable is needed! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1257) Endless running task when using pyspark with input file containing a long line
[ https://issues.apache.org/jira/browse/SPARK-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049933#comment-14049933 ] Matthew Farrellee commented on SPARK-1257: -- recommend close as resolved w/ option for filer to reopen if the issue reproduces in 1.0 /cc: [~pwendell] [~joshrosen] > Endless running task when using pyspark with input file containing a long line > -- > > Key: SPARK-1257 > URL: https://issues.apache.org/jira/browse/SPARK-1257 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.0 >Reporter: Hanchen Su > > When launching any pyspark applications with an input file containing a very > long line(about 7 characters), the job will be hanging and never stops. > The application UI shows that there is a task running endlessly. > There will be no problem using the scala version with the same input. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1550) Successive creation of spark context fails in pyspark, if the previous initialization of spark context had failed.
[ https://issues.apache.org/jira/browse/SPARK-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049918#comment-14049918 ] Matthew Farrellee commented on SPARK-1550: -- this issue as reported is no longer present in spark 1.0, where defaults are provided for app name and master. {code} $ SPARK_HOME=dist PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.8.1-src.zip python Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from pyspark import SparkContext >>> sc=SparkContext('local') [successful creation of context] {code} i believe this should be closed as resolved. /cc: [~pwendell] > Successive creation of spark context fails in pyspark, if the previous > initialization of spark context had failed. > -- > > Key: SPARK-1550 > URL: https://issues.apache.org/jira/browse/SPARK-1550 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Prabin Banka > Labels: pyspark, sparkcontext > > For example;- > In PySpark, if we try to initialize spark context with insufficient > arguments, >>>sc=SparkContext('local') > it fails with an exception > Exception: An application name must be set in your configuration > This is all fine. > However, any successive creation of spark context with correct arguments, > also fails, > >>>s1=SparkContext('local', 'test1') > AttributeError: 'SparkContext' object has no attribute 'master' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1850) Bad exception if multiple jars exist when running PySpark
[ https://issues.apache.org/jira/browse/SPARK-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049895#comment-14049895 ] Matthew Farrellee commented on SPARK-1850: -- [~andrewor14] - i think this should be closed as resolved in SPARK-2242 the current output for the error is, {noformat} $ ./dist/bin/pyspark Python 2.7.5 (default, Feb 19 2014, 13:47:28) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/shell.py", line 43, in sc = SparkContext(appName="PySparkShell", pyFiles=add_files) File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/context.py", line 95, in __init__ SparkContext._ensure_initialized(self, gateway=gateway) File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/context.py", line 191, in _ensure_initialized SparkContext._gateway = gateway or launch_gateway() File "/home/matt/Documents/Repositories/spark/dist/python/pyspark/java_gateway.py", line 66, in launch_gateway raise Exception(error_msg) Exception: Launching GatewayServer failed with exit code 1!(Warning: unexpected output detected.) Found multiple Spark assembly jars in /home/matt/Documents/Repositories/spark/dist/lib: spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4-.jar spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar Please remove all but one jar. {noformat} > Bad exception if multiple jars exist when running PySpark > - > > Key: SPARK-1850 > URL: https://issues.apache.org/jira/browse/SPARK-1850 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Andrew Or > Fix For: 1.0.1 > > > {code} > Found multiple Spark assembly jars in > /Users/andrew/Documents/dev/andrew-spark/assembly/target/scala-2.10: > Traceback (most recent call last): > File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/shell.py", > line 43, in > sc = SparkContext(os.environ.get("MASTER", "local[*]"), "PySparkShell", > pyFiles=add_files) > File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", > line 94, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/context.py", > line 180, in _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File > "/Users/andrew/Documents/dev/andrew-spark/python/pyspark/java_gateway.py", > line 49, in launch_gateway > gateway_port = int(proc.stdout.readline()) > ValueError: invalid literal for int() with base 10: > 'spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4-deps.jar\n' > {code} > It's trying to read the Java gateway port as an int from the sub-process' > STDOUT. However, what it read was an error message, which is clearly not an > int. We should differentiate between these cases and just propagate the > original message if it's not an int. Right now, this exception is not very > helpful. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1884) Shark failed to start
[ https://issues.apache.org/jira/browse/SPARK-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049877#comment-14049877 ] Pete MacKinnon commented on SPARK-1884: --- This is due to the version of protobuf-java provided by Shark being older (2.4.1) than what's needed by Hadoop 2.4 (2.5.0). See SPARK-2338. > Shark failed to start > - > > Key: SPARK-1884 > URL: https://issues.apache.org/jira/browse/SPARK-1884 > Project: Spark > Issue Type: Bug >Affects Versions: 0.9.1 > Environment: ubuntu 14.04, spark 0.9.1, hive 0.13.0, hadoop 2.4.0 > (stand alone), scala 2.11.0 >Reporter: Wei Cui >Priority: Blocker > > the hadoop, hive, spark works fine. > when start the shark, it failed with the following messages: > Starting the Shark Command Line Client > 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.input.dir.recursive > is deprecated. Instead, use > mapreduce.input.fileinputformat.input.dir.recursive > 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.max.split.size is > deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize > 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size is > deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize > 14/05/19 16:47:21 INFO Configuration.deprecation: > mapred.min.split.size.per.rack is deprecated. Instead, use > mapreduce.input.fileinputformat.split.minsize.per.rack > 14/05/19 16:47:21 INFO Configuration.deprecation: > mapred.min.split.size.per.node is deprecated. Instead, use > mapreduce.input.fileinputformat.split.minsize.per.node > 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.reduce.tasks is > deprecated. Instead, use mapreduce.job.reduces > 14/05/19 16:47:21 INFO Configuration.deprecation: > mapred.reduce.tasks.speculative.execution is deprecated. Instead, use > mapreduce.reduce.speculative > 14/05/19 16:47:22 WARN conf.Configuration: > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to > override final parameter: mapreduce.job.end-notification.max.retry.interval; > Ignoring. > 14/05/19 16:47:22 WARN conf.Configuration: > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to > override final parameter: mapreduce.cluster.local.dir; Ignoring. > 14/05/19 16:47:22 WARN conf.Configuration: > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to > override final parameter: mapreduce.job.end-notification.max.attempts; > Ignoring. > 14/05/19 16:47:22 WARN conf.Configuration: > org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to > override final parameter: mapreduce.cluster.temp.dir; Ignoring. > Logging initialized using configuration in > jar:file:/usr/local/shark/lib_managed/jars/edu.berkeley.cs.shark/hive-common/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.properties > Hive history > file=/tmp/root/hive_job_log_root_14857@ubuntu_201405191647_897494215.txt > 6.004: [GC 279616K->18440K(1013632K), 0.0438980 secs] > 6.445: [Full GC 59125K->7949K(1013632K), 0.0685160 secs] > Reloading cached RDDs from previous Shark sessions... (use -skipRddReload > flag to skip reloading) > 7.535: [Full GC 104136K->13059K(1013632K), 0.0885820 secs] > 8.459: [Full GC 61237K->18031K(1013632K), 0.0820400 secs] > 8.662: [Full GC 29832K->8958K(1013632K), 0.0869700 secs] > 8.751: [Full GC 13433K->8998K(1013632K), 0.0856520 secs] > 10.435: [Full GC 72246K->14140K(1013632K), 0.1797530 secs] > Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.metastore.HiveMetaStoreClient > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072) > at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49) > at shark.SharkCliDriver.(SharkCliDriver.scala:283) > at shark.SharkCliDriver$.main(SharkCliDriver.scala:162) > at shark.SharkCliDriver.main(SharkCliDriver.scala) > Caused by: java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.metastore.HiveMetaStoreClient > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:51) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61) > at > org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288) > at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299) > at > org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070) > ... 4 more > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorA
[jira] [Commented] (SPARK-2306) BoundedPriorityQueue is private and not registered with Kryo
[ https://issues.apache.org/jira/browse/SPARK-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049818#comment-14049818 ] Daniel Darabos commented on SPARK-2306: --- You're the best, Ankit! Thanks! > BoundedPriorityQueue is private and not registered with Kryo > > > Key: SPARK-2306 > URL: https://issues.apache.org/jira/browse/SPARK-2306 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Daniel Darabos > > Because BoundedPriorityQueue is private and not registered with Kryo, RDD.top > cannot be used when using Kryo (the recommended configuration). > Curiously BoundedPriorityQueue is registered by GraphKryoRegistrator. But > that's the wrong registrator. (Is there one for Spark Core?) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1681) Handle hive support correctly in ./make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-1681: --- Summary: Handle hive support correctly in ./make-distribution.sh (was: Handle hive support correctly in ./make-distribution) > Handle hive support correctly in ./make-distribution.sh > --- > > Key: SPARK-1681 > URL: https://issues.apache.org/jira/browse/SPARK-1681 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > Fix For: 1.0.0 > > > When Hive support is enabled we should copy the datanucleus jars to the > packaged distribution. The simplest way would be to create a lib_managed > folder in the final distribution so that the compute-classpath script > searches in exactly the same way whether or not it's a release. > A slightly nicer solution is to put the jars inside of `/lib` and have some > fancier check for the jar location in the compute-classpath script. > We should also document how to run Spark SQL on YARN when hive support is > enabled. In particular how to add the necessary jars to spark-submit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049778#comment-14049778 ] Eustache commented on SPARK-2341: - Ok then would you mind that I work on a doc improvement for this ? Perhaps a simple no-brainer like "for regression set this to true" could do the job... Personally I think `multiclassOrRegression` is a good option but I let it to you to decide :) > loadLibSVMFile doesn't handle regression datasets > - > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Eustache >Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049765#comment-14049765 ] Xiangrui Meng edited comment on SPARK-2341 at 7/2/14 9:09 AM: -- It is a little awkward to have both `regression` and `multiclass` as input arguments. I agree that a correct name should be `multiclassOrRegression` or `multiclassOrContinuous`. But it is certainly too long. We tried to make this clear in the doc: {code} multiclass: whether the input labels contain more than two classes. If false, any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. So it works for both +1/-1 and 1/0 cases. If true, the double value parsed directly from the label string will be used as the label value. {code} It would be good if we can improve the documentation to make it clearer. But for the API, I don't feel that it is necessary to change. was (Author: mengxr): It is a little awkward to have both `regression` and `multiclass` as input arguments. I agree that a correct name should be `multiclassOrRegression`. But it is certainly too long. We tried to make this clear in the doc: {code} multiclass: whether the input labels contain more than two classes. If false, any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. So it works for both +1/-1 and 1/0 cases. If true, the double value parsed directly from the label string will be used as the label value. {code} It would be good if we can improve the documentation to make it clearer. But for the API, I don't feel that it is necessary to change. > loadLibSVMFile doesn't handle regression datasets > - > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Eustache >Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049765#comment-14049765 ] Xiangrui Meng commented on SPARK-2341: -- It is a little awkward to have both `regression` and `multiclass` as input arguments. I agree that a correct name should be `multiclassOrRegression`. But it is certainly too long. We tried to make this clear in the doc: {code} multiclass: whether the input labels contain more than two classes. If false, any label with value greater than 0.5 will be mapped to 1.0, or 0.0 otherwise. So it works for both +1/-1 and 1/0 cases. If true, the double value parsed directly from the label string will be used as the label value. {code} It would be good if we can improve the documentation to make it clearer. But for the API, I don't feel that it is necessary to change. > loadLibSVMFile doesn't handle regression datasets > - > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Eustache >Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049755#comment-14049755 ] Eustache commented on SPARK-2341: - I see that LabelParser with multiclass=true works for the regression setting. What I fail to understand is how it is related to multiclass ? Is the naming proper ? In any case shouldn't we provide a naming that explicitly mentions regression ? > loadLibSVMFile doesn't handle regression datasets > - > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Eustache >Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049732#comment-14049732 ] Xiangrui Meng commented on SPARK-2341: -- Just set `multiclass = true` to load double values. > loadLibSVMFile doesn't handle regression datasets > - > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Eustache >Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2342) Evaluation helper's output type doesn't conform to input type
[ https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yijie Shen updated SPARK-2342: -- Description: In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: ((Numeric[Any], Any, Any) => Any)): Any {code} is intended to do computations for Numeric add/Minus/Multipy. Just as the comment suggest : {quote}Those expressions are supposed to be in the same data type, and also the return type.{quote} But in code, function f was casted to function signature: {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code} I thought it as a typo and the correct should be: {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code} was: In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: ((Numeric[Any], Any, Any) => Any)): Any is intended to do computations for Numeric add/Minus/Multipy. Just as the comment suggest : "Those expressions are supposed to be in the same data type, and also the return type." But in code, function f was casted to function signature: (Numeric[n.JvmType], n.JvmType, n.JvmType) => Int I thought it as a typo and the correct should be: (Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType > Evaluation helper's output type doesn't conform to input type > - > > Key: SPARK-2342 > URL: https://issues.apache.org/jira/browse/SPARK-2342 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yijie Shen >Priority: Minor > Labels: easyfix > > In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala > {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: > ((Numeric[Any], Any, Any) => Any)): Any {code} > is intended to do computations for Numeric add/Minus/Multipy. > Just as the comment suggest : {quote}Those expressions are supposed to be in > the same data type, and also the return type.{quote} > But in code, function f was casted to function signature: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code} > I thought it as a typo and the correct should be: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2342) Evaluation helper's output type doesn't conform to input type
Yijie Shen created SPARK-2342: - Summary: Evaluation helper's output type doesn't conform to input type Key: SPARK-2342 URL: https://issues.apache.org/jira/browse/SPARK-2342 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Yijie Shen Priority: Minor In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: ((Numeric[Any], Any, Any) => Any)): Any is intended to do computations for Numeric add/Minus/Multipy. Just as the comment suggest : "Those expressions are supposed to be in the same data type, and also the return type." But in code, function f was casted to function signature: (Numeric[n.JvmType], n.JvmType, n.JvmType) => Int I thought it as a typo and the correct should be: (Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2339: Fix Version/s: 1.1.0 > SQL parser in sql-core is case sensitive, but a table alias is converted to > lower case when we create Subquery > -- > > Key: SPARK-2339 > URL: https://issues.apache.org/jira/browse/SPARK-2339 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Yin Huai > Fix For: 1.1.0 > > > Reported by > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html > After we get the table from the catalog, because the table has an alias, we > will temporarily insert a Subquery. Then, we convert the table alias to lower > case no matter if the parser is case sensitive or not. > To see the issue ... > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Person(name: String, age: Int) > val people = > sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p > => Person(p(0), p(1).trim.toInt)) > people.registerAsTable("people") > sqlContext.sql("select PEOPLE.name from people PEOPLE") > {code} > The plan is ... > {code} > == Query Plan == > Project ['PEOPLE.name] > ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at > basicOperators.scala:176 > {code} > You can find that "PEOPLE.name" is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
Eustache created SPARK-2341: --- Summary: loadLibSVMFile doesn't handle regression datasets Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Priority: Minor Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)