[jira] [Commented] (SPARK-2306) BoundedPriorityQueue is private and not registered with Kryo
[ https://issues.apache.org/jira/browse/SPARK-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052198#comment-14052198 ] ankit bhardwaj commented on SPARK-2306: --- Created a pull request for it :https://github.com/apache/spark/pull/1298 > BoundedPriorityQueue is private and not registered with Kryo > > > Key: SPARK-2306 > URL: https://issues.apache.org/jira/browse/SPARK-2306 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Daniel Darabos > > Because BoundedPriorityQueue is private and not registered with Kryo, RDD.top > cannot be used when using Kryo (the recommended configuration). > Curiously BoundedPriorityQueue is registered by GraphKryoRegistrator. But > that's the wrong registrator. (Is there one for Spark Core?) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2059) Unresolved Attributes should cause a failure before execution time
[ https://issues.apache.org/jira/browse/SPARK-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-2059. Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 > Unresolved Attributes should cause a failure before execution time > -- > > Key: SPARK-2059 > URL: https://issues.apache.org/jira/browse/SPARK-2059 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.0.1, 1.1.0 > > > Here's a partial solution: > https://github.com/marmbrus/spark/tree/analysisChecks -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2282: --- Affects Version/s: 0.9.1 > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2282. Resolution: Fixed Fix Version/s: 1.0.0 1.0.1 0.9.2 > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.1, 1.0.0 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2350) Master throws NPE
[ https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2350: --- Fix Version/s: 0.9.2 > Master throws NPE > - > > Key: SPARK-2350 > URL: https://issues.apache.org/jira/browse/SPARK-2350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.1, 1.1.0 > > > ... if we launch a driver and there are more waiting drivers to be launched. > This is because we remove from a list while iterating through this. > Here is the culprit from Master.scala (L487 as of the creation of this JIRA, > commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c). > {code} > for (driver <- waitingDrivers) { > if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= > driver.desc.cores) { > launchDriver(worker, driver) > waitingDrivers -= driver > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2307) SparkUI Storage page cached statuses incorrect
[ https://issues.apache.org/jira/browse/SPARK-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052183#comment-14052183 ] Patrick Wendell commented on SPARK-2307: There was a follow up patch: https://github.com/apache/spark/pull/1255 > SparkUI Storage page cached statuses incorrect > -- > > Key: SPARK-2307 > URL: https://issues.apache.org/jira/browse/SPARK-2307 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.0.1, 1.1.0 > > Attachments: Screen Shot 2014-06-27 at 11.09.54 AM.png > > > See attached: the executor has 512MB, but somehow it has cached (279 + 27 + > 279 + 27) = 612MB? (The correct answer is 279MB). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2350) Master throws NPE
[ https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2350: --- Assignee: Aaron Davidson > Master throws NPE > - > > Key: SPARK-2350 > URL: https://issues.apache.org/jira/browse/SPARK-2350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Aaron Davidson > Fix For: 1.0.1, 1.1.0 > > > ... if we launch a driver and there are more waiting drivers to be launched. > This is because we remove from a list while iterating through this. > Here is the culprit from Master.scala (L487 as of the creation of this JIRA, > commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c). > {code} > for (driver <- waitingDrivers) { > if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= > driver.desc.cores) { > launchDriver(worker, driver) > waitingDrivers -= driver > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2350) Master throws NPE
[ https://issues.apache.org/jira/browse/SPARK-2350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2350. Resolution: Fixed Fix Version/s: 1.0.1 Issue resolved by pull request 1289 [https://github.com/apache/spark/pull/1289] > Master throws NPE > - > > Key: SPARK-2350 > URL: https://issues.apache.org/jira/browse/SPARK-2350 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > Fix For: 1.0.1, 1.1.0 > > > ... if we launch a driver and there are more waiting drivers to be launched. > This is because we remove from a list while iterating through this. > Here is the culprit from Master.scala (L487 as of the creation of this JIRA, > commit bc7041a42dfa84312492ea8cae6fdeaeac4f6d1c). > {code} > for (driver <- waitingDrivers) { > if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= > driver.desc.cores) { > launchDriver(worker, driver) > waitingDrivers -= driver > } > } > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap
[ https://issues.apache.org/jira/browse/SPARK-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2349: - Fix Version/s: 1.1.0 > Fix NPE in ExternalAppendOnlyMap > > > Key: SPARK-2349 > URL: https://issues.apache.org/jira/browse/SPARK-2349 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > Fix For: 1.0.1, 1.1.0 > > > It throws an NPE on null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap
[ https://issues.apache.org/jira/browse/SPARK-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-2349: - Fix Version/s: 1.0.1 > Fix NPE in ExternalAppendOnlyMap > > > Key: SPARK-2349 > URL: https://issues.apache.org/jira/browse/SPARK-2349 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > Fix For: 1.0.1, 1.1.0 > > > It throws an NPE on null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052127#comment-14052127 ] Ankur Dave commented on SPARK-2365: --- Proposed implementation: https://github.com/apache/spark/pull/1297 > Add IndexedRDD, an efficient updatable key-value store > -- > > Key: SPARK-2365 > URL: https://issues.apache.org/jira/browse/SPARK-2365 > Project: Spark > Issue Type: New Feature > Components: GraphX, Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave > > RDDs currently provide a bulk-updatable, iterator-based interface. This > imposes minimal requirements on the storage layer, which only needs to > support sequential access, enabling on-disk and serialized storage. > However, many applications would benefit from a richer interface. Efficient > support for point lookups would enable serving data out of RDDs, but it > currently requires iterating over an entire partition to find the desired > element. Point updates similarly require copying an entire iterator. Joins > are also expensive, requiring a shuffle and local hash joins. > To address these problems, we propose IndexedRDD, an efficient key-value > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key > uniqueness and pre-indexing the entries for efficient joins and point > lookups, updates, and deletions. > It would be implemented by (1) hash-partitioning the entries by key, (2) > maintaining a hash index within each partition, and (3) using purely > functional (immutable and efficiently updatable) data structures to enable > efficient modifications and deletions. > GraphX would be the first user of IndexedRDD, since it currently implements a > limited form of this functionality in VertexRDD. We envision a variety of > other uses for IndexedRDD, including streaming updates to RDDs, direct > serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
Ankur Dave created SPARK-2365: - Summary: Add IndexedRDD, an efficient updatable key-value store Key: SPARK-2365 URL: https://issues.apache.org/jira/browse/SPARK-2365 Project: Spark Issue Type: New Feature Components: GraphX, Spark Core Reporter: Ankur Dave Assignee: Ankur Dave RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal requirements on the storage layer, which only needs to support sequential access, enabling on-disk and serialized storage. However, many applications would benefit from a richer interface. Efficient support for point lookups would enable serving data out of RDDs, but it currently requires iterating over an entire partition to find the desired element. Point updates similarly require copying an entire iterator. Joins are also expensive, requiring a shuffle and local hash joins. To address these problems, we propose IndexedRDD, an efficient key-value store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing the entries for efficient joins and point lookups, updates, and deletions. It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining a hash index within each partition, and (3) using purely functional (immutable and efficiently updatable) data structures to enable efficient modifications and deletions. GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack
[ https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052103#comment-14052103 ] Rui Li commented on SPARK-2277: --- With [PR #892|https://github.com/apache/spark/pull/892], we'll check if a task's preference is available when adding it to pending lists. TaskScheduler tracks information about executor/host, so that TaskSetManager can check if the preferred executor/host is available. TaskScheduler also provides getRackForHost to get the corresponding rack for a host (currently only returns None). I think this is some prior acquired knowledge about the cluster topology, which does not indicate whether there's any host on that rack granted to this spark app. Therefore we don't know the availability of the preferred rack. > Make TaskScheduler track whether there's host on a rack > --- > > Key: SPARK-2277 > URL: https://issues.apache.org/jira/browse/SPARK-2277 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > > When TaskSetManager adds a pending task, it checks whether the tasks's > preferred location is available. Regarding RACK_LOCAL task, we consider the > preferred rack available if such a rack is defined for the preferred host. > This is incorrect as there may be no alive hosts on that rack at all. > Therefore, TaskScheduler should track the hosts on each rack, and provides an > API for TaskSetManager to check if there's host alive on a specific rack. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2364) ShuffledDStream run tasks only when dstream has partition items
guowei created SPARK-2364: - Summary: ShuffledDStream run tasks only when dstream has partition items Key: SPARK-2364 URL: https://issues.apache.org/jira/browse/SPARK-2364 Project: Spark Issue Type: Improvement Components: Streaming Reporter: guowei ShuffledDStream run tasks no matter whether dstream has partition items -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2346) Register as table should not accept table names that start with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052083#comment-14052083 ] Alexander Albul commented on SPARK-2346: You right, thanks > Register as table should not accept table names that start with numbers > --- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1378) Build error: org.eclipse.paho:mqtt-client
[ https://issues.apache.org/jira/browse/SPARK-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052080#comment-14052080 ] Mukul Jain commented on SPARK-1378: --- This seems like still an issue: Downloading: https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom Jul 3, 2014 6:22:27 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: I/O exception (java.net.ConnectException) caught when processing request: Connection timed out Jul 3, 2014 6:22:27 PM org.apache.commons.httpclient.HttpMethodDirector executeWithRetry INFO: Retrying request It is failing to download.. I am running behind corporate firewall not sure if it has anything to do with that. I had my build stuck exactly like this earlier in the build process trying to download scala compiler jar file but after a few attempts it was able proceed/download the file. seems like repo issue. > Build error: org.eclipse.paho:mqtt-client > - > > Key: SPARK-1378 > URL: https://issues.apache.org/jira/browse/SPARK-1378 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 0.9.0 >Reporter: Ken Williams > > Using Maven, I'm unable to build the 0.9.0 distribution I just downloaded. I > attempt like so: > {code} > mvn -U -Pyarn -Dhadoop.version=2.2.0 -Dyarn.version=2.2.0 -DskipTests package > {code} > The Maven error is: > {code} > [ERROR] Failed to execute goal on project spark-examples_2.10: Could not > resolve dependencies for project > org.apache.spark:spark-examples_2.10:jar:0.9.0-incubating: Could not find > artifact org.eclipse.paho:mqtt-client:jar:0.4.0 in nexus > {code} > My Maven version is 3.2.1, running on Java 1.7.0, using Scala 2.10.4. > Is there an additional Maven repository I should add or something? > If I go into the {{pom.xml}} and comment out the {{external/mqtt}} and > {{examples}} modules, the build succeeds. I'm fine without the MQTT stuff, > but I would really like to get the examples working because I haven't played > with Spark before. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2363) Clean MLlib's sample data files
Xiangrui Meng created SPARK-2363: Summary: Clean MLlib's sample data files Key: SPARK-2363 URL: https://issues.apache.org/jira/browse/SPARK-2363 Project: Spark Issue Type: Task Components: MLlib Reporter: Xiangrui Meng Priority: Minor MLlib has sample data under serveral folders: 1) data/mllib 2) data/ 3) mllib/data/* Per previous discussion with [~matei], we want to put them under `data/mllib` and clean outdated files. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2354) BitSet Range Expanded when creating new one
[ https://issues.apache.org/jira/browse/SPARK-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yijie Shen updated SPARK-2354: -- Affects Version/s: 1.0.0 > BitSet Range Expanded when creating new one > --- > > Key: SPARK-2354 > URL: https://issues.apache.org/jira/browse/SPARK-2354 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.1.0 >Reporter: Yijie Shen >Priority: Minor > > BitSet has a constructor parameter named "numBits: Int" and indicate the bit > num inside. > And also, there is a function called "capacity" which represents the long > words number to hold the bits. > When creating new BitSet,for example in '|', I thought the new created one > shouldn't be the size of longer words' length, instead, it should be the > longer set's num of bit > {code}def |(other: BitSet): BitSet = { > val newBS = new BitSet(math.max(numBits, other.numBits)) > // I know by now the numBits isn't a field > {code} > Does it have any other reason to expand the BitSet range I don't know? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2362) newFilesOnly = true FileInputDStream processes existing files in a directory
[ https://issues.apache.org/jira/browse/SPARK-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052076#comment-14052076 ] Tathagata Das commented on SPARK-2362: -- https://github.com/apache/spark/pull/1077 > newFilesOnly = true FileInputDStream processes existing files in a directory > > > Key: SPARK-2362 > URL: https://issues.apache.org/jira/browse/SPARK-2362 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2325) Utils.getLocalDir had better check the directory and choose a good one instead of choosing the first one directly
[ https://issues.apache.org/jira/browse/SPARK-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052075#comment-14052075 ] YanTang Zhai commented on SPARK-2325: - I've created PR: https://github.com/apache/spark/pull/1281. Please help to review. Thanks. > Utils.getLocalDir had better check the directory and choose a good one > instead of choosing the first one directly > - > > Key: SPARK-2325 > URL: https://issues.apache.org/jira/browse/SPARK-2325 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: YanTang Zhai > > If the first directory of spark.local.dir is bad, application will exit with > the exception: > Exception in thread "main" java.io.IOException: Failed to create a temp > directory (under /data1/sparkenv/local) after 10 attempts! > at org.apache.spark.util.Utils$.createTempDir(Utils.scala:258) > at > org.apache.spark.broadcast.HttpBroadcast$.createServer(HttpBroadcast.scala:154) > at > org.apache.spark.broadcast.HttpBroadcast$.initialize(HttpBroadcast.scala:127) > at > org.apache.spark.broadcast.HttpBroadcastFactory.initialize(HttpBroadcastFactory.scala:31) > at > org.apache.spark.broadcast.BroadcastManager.initialize(BroadcastManager.scala:48) > at > org.apache.spark.broadcast.BroadcastManager.(BroadcastManager.scala:35) > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218) > at org.apache.spark.SparkContext.(SparkContext.scala:202) > at JobTaskJoin$.main(JobTaskJoin.scala:9) > at JobTaskJoin.main(JobTaskJoin.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:601) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Utils.getLocalDir had better check the directory and choose a good one > instead of choosing the first one directly. For example, spark.local.dir is > /data1/sparkenv/local,/data2/sparkenv/local. The disk data1 is bad while the > disk data2 is good, we could choose the data2 not data1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2362) newFilesOnly = true FileInputDStream processes existing files in a directory
Tathagata Das created SPARK-2362: Summary: newFilesOnly = true FileInputDStream processes existing files in a directory Key: SPARK-2362 URL: https://issues.apache.org/jira/browse/SPARK-2362 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.0 Reporter: Tathagata Das -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2010) Support for nested data in PySpark SQL
[ https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052068#comment-14052068 ] Kan Zhang commented on SPARK-2010: -- Sounds reasonable. Named tuple is a better fit than dictionary for struct type. Presumably it is due to lack of pickling support for named tuple that we resorted to dictionary for python schema definition. But for nested dictionaries, we should treat them as map type. > Support for nested data in PySpark SQL > -- > > Key: SPARK-2010 > URL: https://issues.apache.org/jira/browse/SPARK-2010 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Kan Zhang > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark
[ https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bert Greevenbosch updated SPARK-2352: - Summary: [MLLIB] Add Artificial Neural Network (ANN) to Spark (was: Add Artificial Neural Network (ANN) to Spark) > [MLLIB] Add Artificial Neural Network (ANN) to Spark > > > Key: SPARK-2352 > URL: https://issues.apache.org/jira/browse/SPARK-2352 > Project: Spark > Issue Type: New Feature > Components: MLlib > Environment: MLLIB code >Reporter: Bert Greevenbosch > > It would be good if the Machine Learning Library contained Artificial Neural > Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2361) Decide whether to broadcast or serialize the weights directly in MLlib algorithms
Xiangrui Meng created SPARK-2361: Summary: Decide whether to broadcast or serialize the weights directly in MLlib algorithms Key: SPARK-2361 URL: https://issues.apache.org/jira/browse/SPARK-2361 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng In the current implementation, MLlib serializes weights directly into closure. This is okay for small feature dimension, but not efficient for feature dimensions beyond 1M. Especially the default akka.frameSize is 10m. We should use broadcast when the size of the serialized task is going to be large. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2360) CSV import to SchemaRDDs
Michael Armbrust created SPARK-2360: --- Summary: CSV import to SchemaRDDs Key: SPARK-2360 URL: https://issues.apache.org/jira/browse/SPARK-2360 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Minor -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2010) Support for nested data in PySpark SQL
[ https://issues.apache.org/jira/browse/SPARK-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052035#comment-14052035 ] Michael Armbrust commented on SPARK-2010: - I think probably the right thing to do here is use named tuples instead of dictionaries as the python struct equivalent. Dictionaries can then be used for maps. One issue here is that we will need to fix our pickling library used by pyspark as it cannot serialize named tuples. > Support for nested data in PySpark SQL > -- > > Key: SPARK-2010 > URL: https://issues.apache.org/jira/browse/SPARK-2010 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Kan Zhang > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2358) Add an option to include native BLAS/LAPACK loader in the build
[ https://issues.apache.org/jira/browse/SPARK-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052030#comment-14052030 ] Xiangrui Meng commented on SPARK-2358: -- PR: https://github.com/apache/spark/pull/1295 > Add an option to include native BLAS/LAPACK loader in the build > --- > > Key: SPARK-2358 > URL: https://issues.apache.org/jira/browse/SPARK-2358 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It would be easy for users to include the netlib-java jniloader in the spark > jar, which is LGPL-licensed. We can follow the same approach as ganglia > support in Spark, which is enabled by turning on "SPARK_GANGLIA_LGPL" at > build time. We can use "SPARK_NETLIB_LGPL" flag for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2346) Register as table should not accept table names that start with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052020#comment-14052020 ] Michael Armbrust edited comment on SPARK-2346 at 7/4/14 12:03 AM: -- The goal of the sql method is to provide something that is close to SQL-92, which explicitly disallows identifiers that start with numbers (as this makes expressions like 2e2 kind of ambiguous). I think your query will work if you run it using the hive parser, using the hql method, instead. was (Author: marmbrus): The goal of the SQL method is to provide something that is close to SQL-92, which explicitly disallows identifiers that start with numbers (as this makes expressions like 2e2 kind of ambiguous). I think your query will work if you run it using the hive parser, using the hql method, instead. > Register as table should not accept table names that start with numbers > --- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2346) Register as table should not accept table names that start with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052020#comment-14052020 ] Michael Armbrust commented on SPARK-2346: - The goal of the SQL method is to provide something that is close to SQL-92, which explicitly disallows identifiers that start with numbers (as this makes expressions like 2e2 kind of ambiguous). I think your query will work if you run it using the hive parser, using the hql method, instead. > Register as table should not accept table names that start with numbers > --- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2346) Register as table should not accept table names that start with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052018#comment-14052018 ] Alexander Albul commented on SPARK-2346: Well, it depends actually. The link that you sent just show a limitation in postgress database that means that they have not an optimal lexer. >From other hand - Hive support any kind of table name. I found this bug >because i migrate from Shark to Spark SQL and some of my tests that had a >tables starting with numbers started to fail. > Register as table should not accept table names that start with numbers > --- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2346) Error parsing table names that starts with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2346: Labels: starter (was: Parser SQL) > Error parsing table names that starts with numbers > -- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul > Labels: starter > Fix For: 1.1.0 > > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2346) Error parsing table names that starts with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2346: Fix Version/s: 1.1.0 > Error parsing table names that starts with numbers > -- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2346) Register as table should not accept table names that start with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2346: Summary: Register as table should not accept table names that start with numbers (was: Error parsing table names that starts with numbers) > Register as table should not accept table names that start with numbers > --- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2346) Error parsing table names that starts with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052004#comment-14052004 ] Michael Armbrust commented on SPARK-2346: - Here is more info: http://stackoverflow.com/questions/15917064/table-or-column-name-cannot-start-with-numeric > Error parsing table names that starts with numbers > -- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul > Labels: starter > Fix For: 1.1.0 > > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2346) Error parsing table names that starts with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2346: Priority: Minor (was: Major) > Error parsing table names that starts with numbers > -- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2346) Error parsing table names that starts with numbers
[ https://issues.apache.org/jira/browse/SPARK-2346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14052002#comment-14052002 ] Michael Armbrust commented on SPARK-2346: - I think this is actually a bug in the registerAsTable function. It is not valid SQL to start a table name with a number AFAIK. > Error parsing table names that starts with numbers > -- > > Key: SPARK-2346 > URL: https://issues.apache.org/jira/browse/SPARK-2346 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Alexander Albul > Labels: Parser, SQL > > Looks like *org.apache.spark.sql.catalyst.SqlParser* cannot parse table names > when they start from numbers. > Steps to reproduce: > {code:title=Test.scala|borderStyle=solid} > case class Data(value: String) > object Test { > def main(args: Array[String]) { > val sc = new SparkContext("local", "sql") > val sqlSc = new SQLContext(sc) > import sqlSc._ > sc.parallelize(List(Data("one"), > Data("two"))).registerAsTable("123_table") > sql("SELECT * FROM '123_table'").collect().foreach(println) > } > } > {code} > And here is an exception: > {quote} > Exception in thread "main" java.lang.RuntimeException: [1.15] failure: ``('' > expected but "123_table" found > SELECT * FROM '123_table' > ^ > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:47) > at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:70) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:150) > at io.ubix.spark.Test$.main(Test.scala:24) > at io.ubix.spark.Test.main(Test.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) > {quote} > When i am changing from 123_table to table_123 problem disappears. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051998#comment-14051998 ] Tathagata Das commented on SPARK-1853: -- If you look at what is shown in any spark program's stage description, it shows the lines of "user code" that created that stage. For streaming, it shows the lines of internal code (spark streaming code) instead of the user code that created it. So in this case of the screenshot, it should show 4520 - take at Tutorial.scala:34 4521 - map at Tutorial.scala:XXX ... 4513 - reduceByKey at Tutorial.scala:YYY > Show Streaming application code context (file, line number) in Spark Stages UI > -- > > Key: SPARK-1853 > URL: https://issues.apache.org/jira/browse/SPARK-1853 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Tathagata Das >Assignee: Mubarak Seyed > Fix For: 1.1.0 > > Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png > > > Right now, the code context (file, and line number) shown for streaming jobs > in stages UI is meaningless as it refers to internal DStream: > rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-1853: - Assignee: Mubarak Seyed > Show Streaming application code context (file, line number) in Spark Stages UI > -- > > Key: SPARK-1853 > URL: https://issues.apache.org/jira/browse/SPARK-1853 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Tathagata Das >Assignee: Mubarak Seyed > Fix For: 1.1.0 > > Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png > > > Right now, the code context (file, and line number) shown for streaming jobs > in stages UI is meaningless as it refers to internal DStream: > rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap
[ https://issues.apache.org/jira/browse/SPARK-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2349. --- Resolution: Fixed https://github.com/apache/spark/pull/1288 > Fix NPE in ExternalAppendOnlyMap > > > Key: SPARK-2349 > URL: https://issues.apache.org/jira/browse/SPARK-2349 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > > It throws an NPE on null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2359) Supporting common statistical functions in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doris Xin updated SPARK-2359: - Summary: Supporting common statistical functions in MLlib (was: Supporting common statistical estimators in MLlib) > Supporting common statistical functions in MLlib > > > Key: SPARK-2359 > URL: https://issues.apache.org/jira/browse/SPARK-2359 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Reynold Xin >Assignee: Doris Xin > > This is originally proposed by [~falaki]. > This is a proposal for a new package within the Spark distribution to support > common statistical estimators. We think consolidating statistical related > functions in a separate package will help with readability of core source > code and encourage spark users to submit back their functions. > Please see the initial design document here: > https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2359) Supporting common statistical estimators in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2359: - Summary: Supporting common statistical estimators in MLlib (was: Spark Stats package: supporting common statistical estimators for Big Data) > Supporting common statistical estimators in MLlib > - > > Key: SPARK-2359 > URL: https://issues.apache.org/jira/browse/SPARK-2359 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Reynold Xin >Assignee: Doris Xin > > This is originally proposed by [~falaki]. > This is a proposal for a new package within the Spark distribution to support > common statistical estimators. We think consolidating statistical related > functions in a separate package will help with readability of core source > code and encourage spark users to submit back their functions. > Please see the initial design document here: > https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2359) Spark Stats package: supporting common statistical estimators for Big Data
Reynold Xin created SPARK-2359: -- Summary: Spark Stats package: supporting common statistical estimators for Big Data Key: SPARK-2359 URL: https://issues.apache.org/jira/browse/SPARK-2359 Project: Spark Issue Type: New Feature Reporter: Reynold Xin This is originally proposed by [~falaki]. This is a proposal for a new package within the Spark distribution to support common statistical estimators. We think consolidating statistical related functions in a separate package will help with readability of core source code and encourage spark users to submit back their functions. Please see the initial design document here: https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2359) Spark Stats package: supporting common statistical estimators for Big Data
[ https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2359: - Component/s: MLlib > Spark Stats package: supporting common statistical estimators for Big Data > -- > > Key: SPARK-2359 > URL: https://issues.apache.org/jira/browse/SPARK-2359 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Reynold Xin >Assignee: Doris Xin > > This is originally proposed by [~falaki]. > This is a proposal for a new package within the Spark distribution to support > common statistical estimators. We think consolidating statistical related > functions in a separate package will help with readability of core source > code and encourage spark users to submit back their functions. > Please see the initial design document here: > https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2359) Spark Stats package: supporting common statistical estimators for Big Data
[ https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-2359: --- Assignee: Doris Xin > Spark Stats package: supporting common statistical estimators for Big Data > -- > > Key: SPARK-2359 > URL: https://issues.apache.org/jira/browse/SPARK-2359 > Project: Spark > Issue Type: New Feature >Reporter: Reynold Xin >Assignee: Doris Xin > > This is originally proposed by [~falaki]. > This is a proposal for a new package within the Spark distribution to support > common statistical estimators. We think consolidating statistical related > functions in a separate package will help with readability of core source > code and encourage spark users to submit back their functions. > Please see the initial design document here: > https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2017) web ui stage page becomes unresponsive when the number of tasks is large
[ https://issues.apache.org/jira/browse/SPARK-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051977#comment-14051977 ] Reynold Xin commented on SPARK-2017: It is definitely browser specific (but for all browsers!!!). That's why I think just having the aggregated metrics by default and the list of tasks that failed is probably a good idea. Thoughts? > web ui stage page becomes unresponsive when the number of tasks is large > > > Key: SPARK-2017 > URL: https://issues.apache.org/jira/browse/SPARK-2017 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin > Labels: starter > > {code} > sc.parallelize(1 to 100, 100).count() > {code} > The above code creates one million tasks to be executed. The stage detail web > ui page takes forever to load (if it ever completes). > There are again a few different alternatives: > 0. Limit the number of tasks we show. > 1. Pagination > 2. By default only show the aggregate metrics and failed tasks, and hide the > successful ones. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1516) Yarn Client should not call System.exit, should throw exception instead.
[ https://issues.apache.org/jira/browse/SPARK-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1516: - Fix Version/s: 0.9.2 > Yarn Client should not call System.exit, should throw exception instead. > > > Key: SPARK-1516 > URL: https://issues.apache.org/jira/browse/SPARK-1516 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: DB Tsai > Fix For: 0.9.2, 1.0.1 > > > People submit spark job inside their application to yarn cluster using spark > yarn client, and it's not desirable to call System.exit in yarn client which > will terminate the parent application as well. > We should throw exception instead, and people can determine which action they > want to take given the exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1516) Yarn Client should not call System.exit, should throw exception instead.
[ https://issues.apache.org/jira/browse/SPARK-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051975#comment-14051975 ] Xiangrui Meng commented on SPARK-1516: -- PR for branch-0.9: https://github.com/apache/spark/pull/1099 > Yarn Client should not call System.exit, should throw exception instead. > > > Key: SPARK-1516 > URL: https://issues.apache.org/jira/browse/SPARK-1516 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: DB Tsai > Fix For: 0.9.2, 1.0.1 > > > People submit spark job inside their application to yarn cluster using spark > yarn client, and it's not desirable to call System.exit in yarn client which > will terminate the parent application as well. > We should throw exception instead, and people can determine which action they > want to take given the exception. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051967#comment-14051967 ] Mubarak Seyed commented on SPARK-1853: -- Hi TD, Stages UI shows _code context_ for both application file and internal DStream. Are you referring to remove the _internal DStream_ code context in description? !Screen Shot 2014-07-03 at 2.54.05 PM.png|width=300,height=500! Thanks, Mubarak > Show Streaming application code context (file, line number) in Spark Stages UI > -- > > Key: SPARK-1853 > URL: https://issues.apache.org/jira/browse/SPARK-1853 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Tathagata Das > Fix For: 1.1.0 > > Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png > > > Right now, the code context (file, and line number) shown for streaming jobs > in stages UI is meaningless as it refers to internal DStream: > rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2109) Setting SPARK_MEM for bin/pyspark does not work.
[ https://issues.apache.org/jira/browse/SPARK-2109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2109. Resolution: Fixed Fixed in master and 1.0 via https://github.com/apache/spark/pull/1050/files > Setting SPARK_MEM for bin/pyspark does not work. > - > > Key: SPARK-2109 > URL: https://issues.apache.org/jira/browse/SPARK-2109 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma >Priority: Critical > Fix For: 1.0.1, 1.1.0 > > > prashant@sc:~/work/spark$ SPARK_MEM=10G bin/pyspark > Python 2.7.6 (default, Mar 22 2014, 22:59:56) > [GCC 4.8.2] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > Traceback (most recent call last): > File "/home/prashant/work/spark/python/pyspark/shell.py", line 43, in > > sc = SparkContext(appName="PySparkShell", pyFiles=add_files) > File "/home/prashant/work/spark/python/pyspark/context.py", line 94, in > __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File "/home/prashant/work/spark/python/pyspark/context.py", line 190, in > _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File "/home/prashant/work/spark/python/pyspark/java_gateway.py", line 51, > in launch_gateway > gateway_port = int(proc.stdout.readline()) > ValueError: invalid literal for int() with base 10: 'Warning: SPARK_MEM is > deprecated, please use a more specific config option\n' -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1853) Show Streaming application code context (file, line number) in Spark Stages UI
[ https://issues.apache.org/jira/browse/SPARK-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mubarak Seyed updated SPARK-1853: - Attachment: Screen Shot 2014-07-03 at 2.54.05 PM.png > Show Streaming application code context (file, line number) in Spark Stages UI > -- > > Key: SPARK-1853 > URL: https://issues.apache.org/jira/browse/SPARK-1853 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.0.0 >Reporter: Tathagata Das > Fix For: 1.1.0 > > Attachments: Screen Shot 2014-07-03 at 2.54.05 PM.png > > > Right now, the code context (file, and line number) shown for streaming jobs > in stages UI is meaningless as it refers to internal DStream: > rather than user application file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051946#comment-14051946 ] Xiangrui Meng commented on SPARK-2308: -- Is there a reference paper/work about using uniform sampling in k-means? Usually in practice the clusters are not balanced. With uniform sampling, you may miss many points from a small cluster. > Add KMeans MiniBatch clustering algorithm to MLlib > -- > > Key: SPARK-2308 > URL: https://issues.apache.org/jira/browse/SPARK-2308 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Priority: Minor > > Mini-batch is a version of KMeans that uses a randomly-sampled subset of the > data points in each iteration instead of the full set of data points, > improving performance (and in some cases, accuracy). The mini-batch version > is compatible with the KMeans|| initialization algorithm currently > implemented in MLlib. > I suggest adding KMeans Mini-batch as an alternative. > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-2358) Add an option to include native BLAS/LAPACK loader in the build
[ https://issues.apache.org/jira/browse/SPARK-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-2358: Assignee: Xiangrui Meng > Add an option to include native BLAS/LAPACK loader in the build > --- > > Key: SPARK-2358 > URL: https://issues.apache.org/jira/browse/SPARK-2358 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > It would be easy for users to include the netlib-java jniloader in the spark > jar, which is LGPL-licensed. We can follow the same approach as ganglia > support in Spark, which is enabled by turning on "SPARK_GANGLIA_LGPL" at > build time. We can use "SPARK_NETLIB_LGPL" flag for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2353) ArrayIndexOutOfBoundsException in scheduler
[ https://issues.apache.org/jira/browse/SPARK-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated SPARK-2353: --- Description: I suspect the recent changes from SPARK-1937 to compute valid locality levels (and ignoring ones which are not applicable) has resulted in this issue. Specifically, some of the code using currentLocalityIndex (and lastLaunchTime actually) seems to be assuming a) constant population of locality levels. b) probably also immutablility/repeatibility of locality levels These do not hold any longer. I do not have the exact values for which this failure was observed (since this is from the logs of a failed job) - but the code path is suspect. Also note that the line numbers/classes might not exactly match master since we are in the middle of a merge. But the issue should hopefully be evident. java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:439) at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:388) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:248) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:244) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:241) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:241) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:241) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:241) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:133) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:86) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Unfortunately, we do not have the bandwidth to tackle this issue - would be great if someone could take a look at it ! Thanks. was: I suspect the recent changes from SPARK-1937 to compute valid locality levels (and ignoring ones which are not applicable) has resulted in this issue. Specifically, some of the code using currentLocalityIndex (and lastLaunchTime actually) seems to be assuming a) constant population of locality levels. b) probably also immutablility/repeatibility of locality levels These do not hold any longer. I do not have the exact values for which this failure was observed (since this is from the logs of a failed job) - but the code path is highly suspect. Also note that the line numbers/classes might not exactly match master since we are in the middle of a merge. But the issue should hopefully be evident. java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:439) at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:388) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:248) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:244) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:241) at scala.collection.Indexe
[jira] [Created] (SPARK-2358) Add an option to include native BLAS/LAPACK libraries in the build
Xiangrui Meng created SPARK-2358: Summary: Add an option to include native BLAS/LAPACK libraries in the build Key: SPARK-2358 URL: https://issues.apache.org/jira/browse/SPARK-2358 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng It would be easy for users to include the netlib-java jniloader in the spark jar, which is LGPL-licensed. We can follow the same approach as ganglia support in Spark, which is enabled by turning on "SPARK_GANGLIA_LGPL" at build time. We can use "SPARK_NETLIB_LGPL" flag for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2358) Add an option to include native BLAS/LAPACK loader in the build
[ https://issues.apache.org/jira/browse/SPARK-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2358: - Summary: Add an option to include native BLAS/LAPACK loader in the build (was: Add an option to include native BLAS/LAPACK libraries in the build) > Add an option to include native BLAS/LAPACK loader in the build > --- > > Key: SPARK-2358 > URL: https://issues.apache.org/jira/browse/SPARK-2358 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng > > It would be easy for users to include the netlib-java jniloader in the spark > jar, which is LGPL-licensed. We can follow the same approach as ganglia > support in Spark, which is enabled by turning on "SPARK_GANGLIA_LGPL" at > build time. We can use "SPARK_NETLIB_LGPL" flag for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2357) HashFilteredJoin doesn't match some equi-join query
[ https://issues.apache.org/jira/browse/SPARK-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zongheng Yang resolved SPARK-2357. -- Resolution: Not a Problem > HashFilteredJoin doesn't match some equi-join query > --- > > Key: SPARK-2357 > URL: https://issues.apache.org/jira/browse/SPARK-2357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0, 1.0.1 >Reporter: Zongheng Yang >Priority: Minor > > For instance, this query: > hql("""SELECT * FROM src a JOIN src b ON a.key = 238""") > is a case where the HashFilteredJoin pattern doesn't match. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2342) Evaluation helper's output type doesn't conform to input type
[ https://issues.apache.org/jira/browse/SPARK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2342. - Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 > Evaluation helper's output type doesn't conform to input type > - > > Key: SPARK-2342 > URL: https://issues.apache.org/jira/browse/SPARK-2342 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Yijie Shen >Priority: Minor > Labels: easyfix > Fix For: 1.0.1, 1.1.0 > > > In sql/catalyst/org/apache/spark/sql/catalyst/expressions.scala > {code}protected final def n2 ( i: Row, e1: Expression, e2: Expression, f: > ((Numeric[Any], Any, Any) => Any)): Any {code} > is intended to do computations for Numeric add/Minus/Multipy. > Just as the comment suggest : {quote}Those expressions are supposed to be in > the same data type, and also the return type.{quote} > But in code, function f was casted to function signature: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => Int{code} > I thought it as a typo and the correct should be: > {code}(Numeric[n.JvmType], n.JvmType, n.JvmType) => n.JvmType{code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1997) Update breeze to version 0.8.1
[ https://issues.apache.org/jira/browse/SPARK-1997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051882#comment-14051882 ] Xiangrui Meng commented on SPARK-1997: -- [~gq] Could you help test the following? 1) dependencies changes in breeze 0.8.1 and their license, including libraries added and removed 2) number of files in the breeze 0.8.1 jar > Update breeze to version 0.8.1 > -- > > Key: SPARK-1997 > URL: https://issues.apache.org/jira/browse/SPARK-1997 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Guoqiang Li >Assignee: Guoqiang Li > > {{breeze 0.7}} does not support {{scala 2.11}} . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2357) HashFilteredJoin doesn't match some equi-join query
Zongheng Yang created SPARK-2357: Summary: HashFilteredJoin doesn't match some equi-join query Key: SPARK-2357 URL: https://issues.apache.org/jira/browse/SPARK-2357 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0, 1.0.1 Reporter: Zongheng Yang Priority: Minor For instance, this query: hql("""SELECT * FROM src a JOIN src b ON a.key = 238""") is a case where the HashFilteredJoin pattern doesn't match. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1675) Make clear whether computePrincipalComponents requires centered data
[ https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1675. -- Resolution: Fixed Fix Version/s: 1.1.0 Issue resolved by pull request 1171 [https://github.com/apache/spark/pull/1171] > Make clear whether computePrincipalComponents requires centered data > > > Key: SPARK-1675 > URL: https://issues.apache.org/jira/browse/SPARK-1675 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Trivial > Fix For: 1.1.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2355) Check for the number of clusters to avoid ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-2355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2355. -- Resolution: Duplicate > Check for the number of clusters to avoid ArrayIndexOutOfBoundsException > > > Key: SPARK-2355 > URL: https://issues.apache.org/jira/browse/SPARK-2355 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Liang-Chi Hsieh > > When the number of clusters given to perform with > org.apache.spark.mllib.clustering.KMeans under parallel initial mode is > greater than data number, it will throw ArrayIndexOutOfBoundsException. > KMeans class should check the number of clusters that must not be greater > than data number. > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1.apply$mcVI$sp(LocalKMeans.scala:62) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > at > org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:49) > at > org.apache.spark.mllib.clustering.KMeans$$anonfun$20.apply(KMeans.scala:297) > at > org.apache.spark.mllib.clustering.KMeans$$anonfun$20.apply(KMeans.scala:294) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.Range.foreach(Range.scala:141) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:294) > at > org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) > at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) > at > org.apache.spark.examples.mllib.DenseKMeans$.run(DenseKMeans.scala:102) > at > org.apache.spark.examples.mllib.DenseKMeans$$anonfun$main$1.apply(DenseKMeans.scala:72) > at > org.apache.spark.examples.mllib.DenseKMeans$$anonfun$main$1.apply(DenseKMeans.scala:71) > at scala.Option.map(Option.scala:145) > at > org.apache.spark.examples.mllib.DenseKMeans$.main(DenseKMeans.scala:71) > at org.apache.spark.examples.mllib.DenseKMeans.main(DenseKMeans.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2330) Spark shell has weird scala semantics
[ https://issues.apache.org/jira/browse/SPARK-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2330. - Resolution: Duplicate Going to close this as a duplicate. We should have a fix for the original issue soon. > Spark shell has weird scala semantics > - > > Key: SPARK-2330 > URL: https://issues.apache.org/jira/browse/SPARK-2330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.1, 1.0.0 > Environment: Ubuntu 14.04 with spark-x.x.x-bin-hadoop2 >Reporter: Andrea Ferretti > Labels: scala, shell > > Normal scala expressions are interpreted in a strange way in the spark shell. > For instance > {noformat} > case class Foo(x: Int) > def print(f: Foo) = f.x > val f = Foo(3) > print(f) > :24: error: type mismatch; > found : Foo > required: Foo > {noformat} > For another example > {noformat} > trait Currency > case object EUR extends Currency > case object USD extends Currency > def nextCurrency: Currency = nextInt(2) match { > case 0 => EUR > case _ => USD > } > :22: error: type mismatch; > found : EUR.type > required: Currency > case 0 => EUR > :24: error: type mismatch; > found : USD.type > required: Currency > case _ => USD > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop
[ https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kostiantyn Kudriavtsev updated SPARK-2356: -- Summary: Exception: Could not locate executable null\bin\winutils.exe in the Hadoop (was: Exaption: Could not locate executable null\bin\winutils.exe in the Hadoop ) > Exception: Could not locate executable null\bin\winutils.exe in the Hadoop > --- > > Key: SPARK-2356 > URL: https://issues.apache.org/jira/browse/SPARK-2356 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Kostiantyn Kudriavtsev > > I'm trying to run some transformation on Spark, it works fine on cluster > (YARN, linux machines). However, when I'm trying to run it on local machine > (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file > from local filesystem): > 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the > hadoop binary path > java.io.IOException: Could not locate executable null\bin\winutils.exe in the > Hadoop binaries. > at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) > at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) > at org.apache.hadoop.util.Shell.(Shell.java:326) > at org.apache.hadoop.util.StringUtils.(StringUtils.java:76) > at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) > at org.apache.hadoop.security.Groups.(Groups.java:77) > at > org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) > at > org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) > at > org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) > at > org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109) > at > org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) > at org.apache.spark.SparkContext.(SparkContext.scala:228) > at org.apache.spark.SparkContext.(SparkContext.scala:97) > It's happend because Hadoop config is initialised each time when spark > context is created regardless is hadoop required or not. > I propose to add some special flag to indicate if hadoop config is required > (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2356) Exaption: Could not locate executable null\bin\winutils.exe in the Hadoop
Kostiantyn Kudriavtsev created SPARK-2356: - Summary: Exaption: Could not locate executable null\bin\winutils.exe in the Hadoop Key: SPARK-2356 URL: https://issues.apache.org/jira/browse/SPARK-2356 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Kostiantyn Kudriavtsev I'm trying to run some transformation on Spark, it works fine on cluster (YARN, linux machines). However, when I'm trying to run it on local machine (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file from local filesystem): 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318) at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333) at org.apache.hadoop.util.Shell.(Shell.java:326) at org.apache.hadoop.util.StringUtils.(StringUtils.java:76) at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93) at org.apache.hadoop.security.Groups.(Groups.java:77) at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255) at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283) at org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36) at org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109) at org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala) at org.apache.spark.SparkContext.(SparkContext.scala:228) at org.apache.spark.SparkContext.(SparkContext.scala:97) It's happend because Hadoop config is initialised each time when spark context is created regardless is hadoop required or not. I propose to add some special flag to indicate if hadoop config is required (or start this configuration manually) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2324) SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
[ https://issues.apache.org/jira/browse/SPARK-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2324. --- Resolution: Fixed Resolved by https://github.com/apache/spark/pull/1274 > SparkContext should not exit directly when spark.local.dir is a list of > multiple paths and one of them has error > > > Key: SPARK-2324 > URL: https://issues.apache.org/jira/browse/SPARK-2324 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: YanTang Zhai > > The spark.local.dir is configured as a list of multiple paths as follows > /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver > node has error, the application will exit since DiskBlockManager exits > directly at createLocalDirs. If the disk data2 of the worker node has error, > the executor will exit either. > DiskBlockManager should not exit directly at createLocalDirs if one of > spark.local.dir has error. Since spark.local.dir has multiple paths, a > problem should not affect the overall situation. > I think DiskBlockManager could ignore the bad directory at createLocalDirs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2355) Check for the number of clusters to avoid ArrayIndexOutOfBoundsException
Liang-Chi Hsieh created SPARK-2355: -- Summary: Check for the number of clusters to avoid ArrayIndexOutOfBoundsException Key: SPARK-2355 URL: https://issues.apache.org/jira/browse/SPARK-2355 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Liang-Chi Hsieh When the number of clusters given to perform with org.apache.spark.mllib.clustering.KMeans under parallel initial mode is greater than data number, it will throw ArrayIndexOutOfBoundsException. KMeans class should check the number of clusters that must not be greater than data number. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.mllib.clustering.LocalKMeans$$anonfun$kMeansPlusPlus$1.apply$mcVI$sp(LocalKMeans.scala:62) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:49) at org.apache.spark.mllib.clustering.KMeans$$anonfun$20.apply(KMeans.scala:297) at org.apache.spark.mllib.clustering.KMeans$$anonfun$20.apply(KMeans.scala:294) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.Range.foreach(Range.scala:141) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:294) at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126) at org.apache.spark.examples.mllib.DenseKMeans$.run(DenseKMeans.scala:102) at org.apache.spark.examples.mllib.DenseKMeans$$anonfun$main$1.apply(DenseKMeans.scala:72) at org.apache.spark.examples.mllib.DenseKMeans$$anonfun$main$1.apply(DenseKMeans.scala:71) at scala.Option.map(Option.scala:145) at org.apache.spark.examples.mllib.DenseKMeans$.main(DenseKMeans.scala:71) at org.apache.spark.examples.mllib.DenseKMeans.main(DenseKMeans.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2354) BitSet Range Expanded when creating new one
Yijie Shen created SPARK-2354: - Summary: BitSet Range Expanded when creating new one Key: SPARK-2354 URL: https://issues.apache.org/jira/browse/SPARK-2354 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Yijie Shen Priority: Minor BitSet has a constructor parameter named "numBits: Int" and indicate the bit num inside. And also, there is a function called "capacity" which represents the long words number to hold the bits. When creating new BitSet,for example in '|', I thought the new created one shouldn't be the size of longer words' length, instead, it should be the longer set's num of bit {code}def |(other: BitSet): BitSet = { val newBS = new BitSet(math.max(numBits, other.numBits)) // I know by now the numBits isn't a field {code} Does it have any other reason to expand the BitSet range I don't know? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex updated SPARK-2344: Description: I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. FCM is very similar to K - Means which is already implemented, and they differ only in the degree of relationship each point has with each cluster: (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. As part of the implementation I would like: - create a base class for K- Means and FCM - implement the relationship for each algorithm differently (in its class) I'd like this to be assigned to me. was: I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. FCM is very similar to K - Means which is already implemented, and they differ only in the degree of relationship each point has with each cluster: (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. As part of the implementation I would like: - create a base class for K- Means and FCM - implement the relationship for each algorithm differently (in its class) Priority: Minor (was: Major) Affects Version/s: (was: 1.0.0) > Add Fuzzy C-Means algorithm to MLlib > > > Key: SPARK-2344 > URL: https://issues.apache.org/jira/browse/SPARK-2344 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Alex >Priority: Minor > Original Estimate: 1m > Remaining Estimate: 1m > > I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. > FCM is very similar to K - Means which is already implemented, and they > differ only in the degree of relationship each point has with each cluster: > (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. > As part of the implementation I would like: > - create a base class for K- Means and FCM > - implement the relationship for each algorithm differently (in its class) > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2277) Make TaskScheduler track whether there's host on a rack
[ https://issues.apache.org/jira/browse/SPARK-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051575#comment-14051575 ] Mridul Muralidharan commented on SPARK-2277: I have not rechecked that the code, but the way it was originally written by me was : a) Task preference is decoupled from availability of the node. For example, we need not have an executor on a host for which a block has host preference (example dfs blocks on a shared cluster) Also note that a block might have one or more preferred location. b) We lookup the rack for the preferred location to get preferred rack. As with (a), there need not be an executor on that rack. This is just the rack preference. c) At schedule time, for an executor, we lookup the host/rack of the executors location - and decide appropriately based on that. In this context, I think your requirement is already handled. Even if we dont have any hosts alive on a rack, those tasks would still be mentioned with rack local preference in task set manager. When an executor comes in (existing or new), we check that executors rack with task preference - and it would now be marked rack local. > Make TaskScheduler track whether there's host on a rack > --- > > Key: SPARK-2277 > URL: https://issues.apache.org/jira/browse/SPARK-2277 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Rui Li > > When TaskSetManager adds a pending task, it checks whether the tasks's > preferred location is available. Regarding RACK_LOCAL task, we consider the > preferred rack available if such a rack is defined for the preferred host. > This is incorrect as there may be no alive hosts on that rack at all. > Therefore, TaskScheduler should track the hosts on each rack, and provides an > API for TaskSetManager to check if there's host alive on a specific rack. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2339) SQL parser in sql-core is case sensitive, but a table alias is converted to lower case when we create Subquery
[ https://issues.apache.org/jira/browse/SPARK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14050721#comment-14050721 ] Yin Huai edited comment on SPARK-2339 at 7/3/14 1:46 PM: - Also, names of those registered tables are case-sensitive. But, names of Hive tables are case-insensitive. It may cause confusion when a user is using HiveContext. I guess we want to keep registered tables case-sensitive. I will add doc to registerAsTable and registerRDDAaTable. was (Author: yhuai): Also, names of those registered tables are case sensitive. But, names of Hive tables are case insensitive. It will cause confusion when a user using HiveContext. I think it may be good to treat all identifiers case insensitive when a user is using HiveContext and make HiveContext.sql as a alias of HiveContext.hql (basically do not expose catalyst's SQLParser in HiveContext). > SQL parser in sql-core is case sensitive, but a table alias is converted to > lower case when we create Subquery > -- > > Key: SPARK-2339 > URL: https://issues.apache.org/jira/browse/SPARK-2339 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Yin Huai > Fix For: 1.1.0 > > > Reported by > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-throws-exception-td8599.html > After we get the table from the catalog, because the table has an alias, we > will temporarily insert a Subquery. Then, we convert the table alias to lower > case no matter if the parser is case sensitive or not. > To see the issue ... > {code} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.createSchemaRDD > case class Person(name: String, age: Int) > val people = > sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p > => Person(p(0), p(1).trim.toInt)) > people.registerAsTable("people") > sqlContext.sql("select PEOPLE.name from people PEOPLE") > {code} > The plan is ... > {code} > == Query Plan == > Project ['PEOPLE.name] > ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at > basicOperators.scala:176 > {code} > You can find that "PEOPLE.name" is not resolved. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2330) Spark shell has weird scala semantics
[ https://issues.apache.org/jira/browse/SPARK-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051298#comment-14051298 ] Sean Owen commented on SPARK-2330: -- Hm. Yes I have the same error now. I don't know why I didn't before as I am pretty sure I just copied and pasted this code exactly as given. > Spark shell has weird scala semantics > - > > Key: SPARK-2330 > URL: https://issues.apache.org/jira/browse/SPARK-2330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.1, 1.0.0 > Environment: Ubuntu 14.04 with spark-x.x.x-bin-hadoop2 >Reporter: Andrea Ferretti > Labels: scala, shell > > Normal scala expressions are interpreted in a strange way in the spark shell. > For instance > {noformat} > case class Foo(x: Int) > def print(f: Foo) = f.x > val f = Foo(3) > print(f) > :24: error: type mismatch; > found : Foo > required: Foo > {noformat} > For another example > {noformat} > trait Currency > case object EUR extends Currency > case object USD extends Currency > def nextCurrency: Currency = nextInt(2) match { > case 0 => EUR > case _ => USD > } > :22: error: type mismatch; > found : EUR.type > required: Currency > case 0 => EUR > :24: error: type mismatch; > found : USD.type > required: Currency > case _ => USD > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1199) Type mismatch in Spark shell when using case class defined in shell
[ https://issues.apache.org/jira/browse/SPARK-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051282#comment-14051282 ] Andrea Ferretti commented on SPARK-1199: More examples on https://issues.apache.org/jira/browse/SPARK-2330 which should also be a duplicate > Type mismatch in Spark shell when using case class defined in shell > --- > > Key: SPARK-1199 > URL: https://issues.apache.org/jira/browse/SPARK-1199 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.0 >Reporter: Andrew Kerr >Assignee: Prashant Sharma >Priority: Blocker > > Define a class in the shell: > {code} > case class TestClass(a:String) > {code} > and an RDD > {code} > val data = sc.parallelize(Seq("a")).map(TestClass(_)) > {code} > define a function on it and map over the RDD > {code} > def itemFunc(a:TestClass):TestClass = a > data.map(itemFunc) > {code} > Error: > {code} > :19: error: type mismatch; > found : TestClass => TestClass > required: TestClass => ? > data.map(itemFunc) > {code} > Similarly with a mapPartitions: > {code} > def partitionFunc(a:Iterator[TestClass]):Iterator[TestClass] = a > data.mapPartitions(partitionFunc) > {code} > {code} > :19: error: type mismatch; > found : Iterator[TestClass] => Iterator[TestClass] > required: Iterator[TestClass] => Iterator[?] > Error occurred in an application involving default arguments. > data.mapPartitions(partitionFunc) > {code} > The behavior is the same whether in local mode or on a cluster. > This isn't specific to RDDs. A Scala collection in the Spark shell has the > same problem. > {code} > scala> Seq(TestClass("foo")).map(itemFunc) > :15: error: type mismatch; > found : TestClass => TestClass > required: TestClass => ? > Seq(TestClass("foo")).map(itemFunc) > ^ > {code} > When run in the Scala console (not the Spark shell) there are no type > mismatch errors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2330) Spark shell has weird scala semantics
[ https://issues.apache.org/jira/browse/SPARK-2330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051280#comment-14051280 ] Andrea Ferretti commented on SPARK-2330: I have tried this by pulling from git and I still have the same issues. I am not sure why you cannot reproduce it. What do you get when you open a shell and paste the lines above? In any case, it does seem to be a duplicate of the issue you linked. > Spark shell has weird scala semantics > - > > Key: SPARK-2330 > URL: https://issues.apache.org/jira/browse/SPARK-2330 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.9.1, 1.0.0 > Environment: Ubuntu 14.04 with spark-x.x.x-bin-hadoop2 >Reporter: Andrea Ferretti > Labels: scala, shell > > Normal scala expressions are interpreted in a strange way in the spark shell. > For instance > {noformat} > case class Foo(x: Int) > def print(f: Foo) = f.x > val f = Foo(3) > print(f) > :24: error: type mismatch; > found : Foo > required: Foo > {noformat} > For another example > {noformat} > trait Currency > case object EUR extends Currency > case object USD extends Currency > def nextCurrency: Currency = nextInt(2) match { > case 0 => EUR > case _ => USD > } > :22: error: type mismatch; > found : EUR.type > required: Currency > case 0 => EUR > :24: error: type mismatch; > found : USD.type > required: Currency > case _ => USD > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051194#comment-14051194 ] Sean Owen commented on SPARK-2341: -- [~mengxr] For regression, rather than further overloading "multiclass" to mean "regression", how about modifying the argument to take on three values (as an enum, string, etc.) to distinguish the three modes. The current method would stay, but be deprecated. multiclass=false is for binary classification. libsvm uses "0" and "1" (or any ints) for binary classification. But this parses it as a real number, and rounds to 0/1. (Is that was libsvm does?) Maybe it's a convenient semantic overload when you want to transform a continuous value to a 0/1 indicator, but is that implied by libsvm format or just a transformation the caller should make? multiclass=true treats libsvm integer labels as doubles, but not continuous values. It seems like inviting more confusion to have this mode also double as the mode for parsing labels that are continuous values as continuous values. libsvm is widely used but it's old; I don't think it's file format from long ago should necessarily inform API design now. There are other serializations besides libsvm (plain CSV for instance) and other algorithms (random decision forests). You can make utilities to convert classes to numbers for benefit of the implementation on the front, and I'll have to in order to use this. Maybe we can start there -- at least if a utility is in the project people aren't all reinventing this in order to use an SVM with actual labels. The caller carries around a dictionary then to do the reverse mapping. The model seems like the place to hold that info, if in fact internally it converts classes to some other representation. Maybe the need would be clearer once the utility is created. As you say I'm concerned that the API is already locked down early and some of these changes are going to be viewed as infeasible just for that reason. > loadLibSVMFile doesn't handle regression datasets > - > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Eustache >Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2353) ArrayIndexOutOfBoundsException in scheduler
Mridul Muralidharan created SPARK-2353: -- Summary: ArrayIndexOutOfBoundsException in scheduler Key: SPARK-2353 URL: https://issues.apache.org/jira/browse/SPARK-2353 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Mridul Muralidharan Priority: Blocker I suspect the recent changes from SPARK-1937 to compute valid locality levels (and ignoring ones which are not applicable) has resulted in this issue. Specifically, some of the code using currentLocalityIndex (and lastLaunchTime actually) seems to be assuming a) constant population of locality levels. b) probably also immutablility/repeatibility of locality levels These do not hold any longer. I do not have the exact values for which this failure was observed (since this is from the logs of a failed job) - but the code path is highly suspect. Also note that the line numbers/classes might not exactly match master since we are in the middle of a merge. But the issue should hopefully be evident. java.lang.ArrayIndexOutOfBoundsException: 2 at org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:439) at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:388) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:248) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:244) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$5.apply(TaskSchedulerImpl.scala:241) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:241) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:241) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:241) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:133) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:86) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051151#comment-14051151 ] Xiangrui Meng commented on SPARK-2341: -- [~srowen] Instead of taking string labels directly, we can provide tools to convert them to integer labels (still Double typed). LIBLINEAR/LIBSVM do not support string labels either, but they are still among the top choices for logistic regression and SVM. [~eustache] Unfortunately, the argument name in Scala is part of the API and loadLibSVMFile is not marked as experimental. So we cannot update the argument name to `multiclassOrRegression`, which is too long anyway. Could you update the doc and change the first sentence from "multiclass: whether the input labels contain more than two classes" to "multiclass: whether the input labels are continuous-valued (for regression) or contain more than two classes"? > loadLibSVMFile doesn't handle regression datasets > - > > Key: SPARK-2341 > URL: https://issues.apache.org/jira/browse/SPARK-2341 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Eustache >Priority: Minor > Labels: easyfix > > Many datasets exist in LibSVM format for regression tasks [1] but currently > the loadLibSVMFile primitive doesn't handle regression datasets. > More precisely, the LabelParser is either a MulticlassLabelParser or a > BinaryLabelParser. What happens then is that the file is loaded but in > multiclass mode : each target value is interpreted as a class name ! > The fix would be to write a RegressionLabelParser which converts target > values to Double and plug it into the loadLibSVMFile routine. > [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.2#6252)