[jira] [Updated] (SPARK-1441) compile Spark Core error with Hadoop 0.23.x
[ https://issues.apache.org/jira/browse/SPARK-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] witgo updated SPARK-1441: - Summary: compile Spark Core error with Hadoop 0.23.x (was: Spark Core build error with Hadoop 0.23.x) > compile Spark Core error with Hadoop 0.23.x > --- > > Key: SPARK-1441 > URL: https://issues.apache.org/jira/browse/SPARK-1441 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: witgo > Attachments: mvn.log, sbt.log > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1428) MLlib should convert non-float64 NumPy arrays to float64 instead of complaining
[ https://issues.apache.org/jira/browse/SPARK-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962636#comment-13962636 ] Sandeep Singh commented on SPARK-1428: -- this should work https://github.com/apache/spark/pull/356 > MLlib should convert non-float64 NumPy arrays to float64 instead of > complaining > --- > > Key: SPARK-1428 > URL: https://issues.apache.org/jira/browse/SPARK-1428 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Matei Zaharia >Priority: Minor > Labels: Starter > > Pretty easy to fix, it would avoid spewing some scary task-failed errors. The > place to fix this is _serialize_double_vector in _common.py. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1441) Spark Core build error with Hadoop 0.23.x
[ https://issues.apache.org/jira/browse/SPARK-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] witgo updated SPARK-1441: - Summary: Spark Core build error with Hadoop 0.23.x (was: Spark Core with Hadoop 0.23.X error) > Spark Core build error with Hadoop 0.23.x > - > > Key: SPARK-1441 > URL: https://issues.apache.org/jira/browse/SPARK-1441 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: witgo > Attachments: mvn.log, sbt.log > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1103) Garbage collect RDD information inside of Spark
[ https://issues.apache.org/jira/browse/SPARK-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1103. Resolution: Fixed > Garbage collect RDD information inside of Spark > --- > > Key: SPARK-1103 > URL: https://issues.apache.org/jira/browse/SPARK-1103 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Tathagata Das >Priority: Blocker > Fix For: 1.0.0 > > > When Spark jobs run for a long period of time, state accumulates. This is > dealt with now using TTL-based cleaning. Instead we should do proper garbage > collection using weak references. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1436) Compression code broke in-memory store
[ https://issues.apache.org/jira/browse/SPARK-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962615#comment-13962615 ] Cheng Lian commented on SPARK-1436: --- Fixed in [this commit|https://github.com/liancheng/spark/commit/1d037b83191099da961c247a57ef686cb508c447] of PR [#330|https://github.com/apache/spark/pull/330] > Compression code broke in-memory store > -- > > Key: SPARK-1436 > URL: https://issues.apache.org/jira/browse/SPARK-1436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.0.0 > > > See my following comment... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
[ https://issues.apache.org/jira/browse/SPARK-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962601#comment-13962601 ] Sandeep Singh commented on SPARK-1433: -- Pull request https://github.com/apache/spark/pull/355 > Upgrade Mesos dependency to 0.17.0 > -- > > Key: SPARK-1433 > URL: https://issues.apache.org/jira/browse/SPARK-1433 > Project: Spark > Issue Type: Task >Reporter: Sandeep Singh >Priority: Minor > > Mesos 0.13.0 was released 6 months ago. > Upgrade Mesos dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1415) Add a minSplits parameter to wholeTextFiles
[ https://issues.apache.org/jira/browse/SPARK-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962585#comment-13962585 ] Matei Zaharia commented on SPARK-1415: -- Hey Xusen, that makes sense. I think that for consistency with our other API methods, we should add minSplits here, and we can compute maxSplitSize from it. Later on we can have versions of the methods that take a maxSplitSize. But on the old Hadoop API for example we can't easily change this, and it seems that a maxSplitSize is always possible to compute from minSplits. > Add a minSplits parameter to wholeTextFiles > --- > > Key: SPARK-1415 > URL: https://issues.apache.org/jira/browse/SPARK-1415 > Project: Spark > Issue Type: Bug >Reporter: Matei Zaharia >Assignee: Xusen Yin > Labels: Starter > > This probably requires adding one to newAPIHadoopFile too. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
[ https://issues.apache.org/jira/browse/SPARK-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962581#comment-13962581 ] Sandeep Singh commented on SPARK-1433: -- [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.002s] [INFO] Spark Project Core SUCCESS [30.635s] [INFO] Spark Project Bagel ... SUCCESS [0.883s] [INFO] Spark Project GraphX .. SUCCESS [0.829s] [INFO] Spark Project ML Library .. SUCCESS [0.805s] [INFO] Spark Project Streaming ... SUCCESS [0.911s] [INFO] Spark Project Tools ... SUCCESS [0.645s] [INFO] Spark Project Catalyst SUCCESS [0.897s] [INFO] Spark Project SQL . SUCCESS [1.193s] [INFO] Spark Project Hive SUCCESS [1.541s] [INFO] Spark Project REPL SUCCESS [1.164s] [INFO] Spark Project Assembly SUCCESS [1.729s] [INFO] Spark Project External Twitter SUCCESS [0.809s] [INFO] Spark Project External Kafka .. SUCCESS [0.591s] [INFO] Spark Project External Flume .. SUCCESS [0.696s] [INFO] Spark Project External ZeroMQ . SUCCESS [0.484s] [INFO] Spark Project External MQTT ... SUCCESS [0.543s] [INFO] Spark Project Examples SUCCESS [2.385s] > Upgrade Mesos dependency to 0.17.0 > -- > > Key: SPARK-1433 > URL: https://issues.apache.org/jira/browse/SPARK-1433 > Project: Spark > Issue Type: Task >Reporter: Sandeep Singh >Priority: Minor > > Mesos 0.13.0 was released 6 months ago. > Upgrade Mesos dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962563#comment-13962563 ] Shivaram Venkataraman commented on SPARK-1391: -- Sorry didn't get a chance to try this yet. Will try to do it tomorrow > BlockManager cannot transfer blocks larger than 2G in size > -- > > Key: SPARK-1391 > URL: https://issues.apache.org/jira/browse/SPARK-1391 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 1.0.0 >Reporter: Shivaram Venkataraman >Assignee: Min Zhou > Attachments: SPARK-1391.diff > > > If a task tries to remotely access a cached RDD block, I get an exception > when the block size is > 2G. The exception is pasted below. > Memory capacities are huge these days (> 60G), and many workflows depend on > having large blocks in memory, so it would be good to fix this bug. > I don't know if the same thing happens on shuffles if one transfer (from > mapper to reducer) is > 2G. > {noformat} > 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer > message > java.lang.ArrayIndexOutOfBoundsException > at > it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) > at > org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) > at > org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) > at > org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) > at > org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) > at > org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) > at > org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) > at > org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) > at > org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at > org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) > at > org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) > at > org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1441) Spark Core with Hadoop 0.23.X error
[ https://issues.apache.org/jira/browse/SPARK-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] witgo updated SPARK-1441: - Attachment: mvn.log sbt.log {code} ./make-distribution.sh --hadoop 0.23.9 > sbt.log mvn -Dhadoop.version=0.23.9 -DskipTests package -X > mvn.log {code} > Spark Core with Hadoop 0.23.X error > --- > > Key: SPARK-1441 > URL: https://issues.apache.org/jira/browse/SPARK-1441 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: witgo > Attachments: mvn.log, sbt.log > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1441) Spark Core with Hadoop 0.23.X error
[ https://issues.apache.org/jira/browse/SPARK-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] witgo updated SPARK-1441: - Summary: Spark Core with Hadoop 0.23.X error (was: build with Hadoop 0.23.X error) > Spark Core with Hadoop 0.23.X error > --- > > Key: SPARK-1441 > URL: https://issues.apache.org/jira/browse/SPARK-1441 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.0.0 >Reporter: witgo > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1441) build with Hadoop 0.23.X error
witgo created SPARK-1441: Summary: build with Hadoop 0.23.X error Key: SPARK-1441 URL: https://issues.apache.org/jira/browse/SPARK-1441 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.0 Reporter: witgo -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
[ https://issues.apache.org/jira/browse/SPARK-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-1433: - Description: Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 was: Mesos 0.14.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 > Upgrade Mesos dependency to 0.17.0 > -- > > Key: SPARK-1433 > URL: https://issues.apache.org/jira/browse/SPARK-1433 > Project: Spark > Issue Type: Task >Reporter: Sandeep Singh >Priority: Minor > > Mesos 0.13.0 was released 6 months ago. > Upgrade Mesos dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1424) InsertInto should work on JavaSchemaRDD as well.
[ https://issues.apache.org/jira/browse/SPARK-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962533#comment-13962533 ] Michael Armbrust commented on SPARK-1424: - Started on this here: https://github.com/apache/spark/pull/354 A few things, there is no way to createTableAs from a standard sql context as I'm not really sure where to put the files. Also, it might be nice to have a create new table that doesn't fail if it exists, but instead appends to it. This is going to require some minor tweaking in the execution engine though, where as the above options were just API extensions. > InsertInto should work on JavaSchemaRDD as well. > > > Key: SPARK-1424 > URL: https://issues.apache.org/jira/browse/SPARK-1424 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.0.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1436) Compression code broke in-memory store
[ https://issues.apache.org/jira/browse/SPARK-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962517#comment-13962517 ] Cheng Lian edited comment on SPARK-1436 at 4/8/14 2:10 AM: --- Sorry, forgot to duplicate the in-memory column byte buffer when creating new {{ColumnAccessor}}'s, so that when the column byte buffer is accessed multiple times, the position is not reset to 0. Will fix this in PR [#330|https://github.com/apache/spark/pull/330] with regression test. was (Author: lian cheng): Sorry, forgot to duplicate the in-memory column byte buffer when creating new {{ColumnAccessor}}s, so that when the column byte buffer is accessed multiple times, the position is not reset to 0. Will fix this in PR [#330|https://github.com/apache/spark/pull/330] with regression test. > Compression code broke in-memory store > -- > > Key: SPARK-1436 > URL: https://issues.apache.org/jira/browse/SPARK-1436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.0.0 > > > See my following comment... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1436) Compression code broke in-memory store
[ https://issues.apache.org/jira/browse/SPARK-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962517#comment-13962517 ] Cheng Lian commented on SPARK-1436: --- Sorry, forgot to duplicate the in-memory column byte buffer when creating new {{ColumnAccessor}}s, so that when the column byte buffer is accessed multiple times, the position is not reset to 0. Will fix this in PR [#330|https://github.com/apache/spark/pull/330] with regression test. > Compression code broke in-memory store > -- > > Key: SPARK-1436 > URL: https://issues.apache.org/jira/browse/SPARK-1436 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.0.0 > > > See my following comment... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1439) Aggregate Scaladocs across projects
Matei Zaharia created SPARK-1439: Summary: Aggregate Scaladocs across projects Key: SPARK-1439 URL: https://issues.apache.org/jira/browse/SPARK-1439 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Matei Zaharia Fix For: 1.0.0 Apparently there's a "Unidoc" plugin to put together ScalaDocs across modules: https://github.com/akka/akka/blob/master/project/Unidoc.scala -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1440) Generate JavaDoc instead of ScalaDoc for Java API
Matei Zaharia created SPARK-1440: Summary: Generate JavaDoc instead of ScalaDoc for Java API Key: SPARK-1440 URL: https://issues.apache.org/jira/browse/SPARK-1440 Project: Spark Issue Type: Sub-task Components: Documentation Reporter: Matei Zaharia Fix For: 1.0.0 It may be possible to use this plugin: https://github.com/typesafehub/genjavadoc -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1351) Documentation Improvements for Spark 1.0
[ https://issues.apache.org/jira/browse/SPARK-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1351: --- Description: Umbrella to track necessary doc improvements. We can break these out into other JIRA's over time. - Use grouping in the RDD and SparkContext scaladocs. See Schema RDD: http://people.apache.org/~pwendell/catalyst-docs/api/sql/core/index.html#org.apache.spark.sql.SchemaRDD - Use spark-submit script wherever possible in docs. - Have package-level documentation in Scaladoc. Also these can be grouped so that the o.a.s package doc looks nice. was: Umbrella to track necessary doc improvements. We can break these out into other JIRA's over time. - Use grouping in the RDD and SparkContext scaladocs. See Schema RDD: http://people.apache.org/~pwendell/catalyst-docs/api/sql/core/index.html#org.apache.spark.sql.SchemaRDD - Use spark-submit script wherever possible in docs. - Have package-level documentation in Scaladoc. > Documentation Improvements for Spark 1.0 > > > Key: SPARK-1351 > URL: https://issues.apache.org/jira/browse/SPARK-1351 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Patrick Wendell >Priority: Critical > Fix For: 1.0.0 > > > Umbrella to track necessary doc improvements. We can break these out into > other JIRA's over time. > - Use grouping in the RDD and SparkContext scaladocs. See Schema RDD: > http://people.apache.org/~pwendell/catalyst-docs/api/sql/core/index.html#org.apache.spark.sql.SchemaRDD > - Use spark-submit script wherever possible in docs. > - Have package-level documentation in Scaladoc. Also these can be grouped so > that the o.a.s package doc looks nice. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1099) Allow inferring number of cores with local[*]
[ https://issues.apache.org/jira/browse/SPARK-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-1099. --- Resolution: Fixed > Allow inferring number of cores with local[*] > - > > Key: SPARK-1099 > URL: https://issues.apache.org/jira/browse/SPARK-1099 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Minor > Fix For: 1.0.0 > > > It seems reasonable that the default number of cores used by spark's local > mode (when no value is specified) is drawn from the spark.cores.max > configuration parameter (which, conveniently, is now settable as a > command-line option in spark-shell). > For the sake of consistency, it's probable that this change would also entail > making the default number of cores when spark.cores.max is NOT specified to > be as many logical cores are on the machine (which is what standalone mode > does). This too seems reasonable, as Spark is inherently a distributed system > and I think it's expected that it should use multiple cores by default. > However, it is a behavioral change, and thus requires caution. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1430) Support sparse data in Python MLlib
[ https://issues.apache.org/jira/browse/SPARK-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1430: - Fix Version/s: 1.0.0 > Support sparse data in Python MLlib > --- > > Key: SPARK-1430 > URL: https://issues.apache.org/jira/browse/SPARK-1430 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Matei Zaharia >Assignee: Matei Zaharia > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1438) Update RDD.sample() API to make seed parameter optional
[ https://issues.apache.org/jira/browse/SPARK-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-1438: - Fix Version/s: 1.0.0 > Update RDD.sample() API to make seed parameter optional > --- > > Key: SPARK-1438 > URL: https://issues.apache.org/jira/browse/SPARK-1438 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Matei Zaharia >Priority: Blocker > Labels: Starter > Fix For: 1.0.0 > > > When a seed is not given, it should pick one based on Math.random(). > This needs to be done in Java and Python as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1438) Update RDD.sample() API to make seed parameter optional
Matei Zaharia created SPARK-1438: Summary: Update RDD.sample() API to make seed parameter optional Key: SPARK-1438 URL: https://issues.apache.org/jira/browse/SPARK-1438 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Matei Zaharia Priority: Blocker When a seed is not given, it should pick one based on Math.random(). This needs to be done in Java and Python as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1437) Jenkins should build with Java 6
[ https://issues.apache.org/jira/browse/SPARK-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1437: - Attachment: Screen Shot 2014-04-07 at 22.53.56.png > Jenkins should build with Java 6 > > > Key: SPARK-1437 > URL: https://issues.apache.org/jira/browse/SPARK-1437 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 0.9.0 >Reporter: Sean Owen >Priority: Minor > Labels: javac, jenkins > Attachments: Screen Shot 2014-04-07 at 22.53.56.png > > > Apologies if this was already on someone's to-do list, but I wanted to track > this, as it bit two commits in the last few weeks. > Spark is intended to work with Java 6, and so compiles with source/target > 1.6. Java 7 can correctly enforce Java 6 language rules and emit Java 6 > bytecode. However, unless otherwise configured with -bootclasspath, javac > will use its own (Java 7) library classes. This means code that uses classes > in Java 7 will be allowed to compile, but the result will fail when run on > Java 6. > This is why you get warnings like ... > Using /usr/java/jdk1.7.0_51 as default JAVA_HOME. > ... > [warn] warning: [options] bootstrap class path not set in conjunction with > -source 1.6 > The solution is just to tell Jenkins to use Java 6. This may be stating the > obvious, but it should just be a setting under "Configure" for > SparkPullRequestBuilder. In our Jenkinses, JDK 6/7/8 are set up; if it's not > an option already I'm guessing it's not too hard to get Java 6 configured on > the Amplab machines. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1391) BlockManager cannot transfer blocks larger than 2G in size
[ https://issues.apache.org/jira/browse/SPARK-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962319#comment-13962319 ] Min Zhou commented on SPARK-1391: - Any update on your test , [~shivaram] ? > BlockManager cannot transfer blocks larger than 2G in size > -- > > Key: SPARK-1391 > URL: https://issues.apache.org/jira/browse/SPARK-1391 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 1.0.0 >Reporter: Shivaram Venkataraman >Assignee: Min Zhou > Attachments: SPARK-1391.diff > > > If a task tries to remotely access a cached RDD block, I get an exception > when the block size is > 2G. The exception is pasted below. > Memory capacities are huge these days (> 60G), and many workflows depend on > having large blocks in memory, so it would be good to fix this bug. > I don't know if the same thing happens on shuffles if one transfer (from > mapper to reducer) is > 2G. > {noformat} > 14/04/02 02:33:10 ERROR storage.BlockManagerWorker: Exception handling buffer > message > java.lang.ArrayIndexOutOfBoundsException > at > it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:96) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:134) > at > it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:164) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:38) > at > org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:93) > at > org.apache.spark.serializer.JavaSerializationStream.writeAll(JavaSerializer.scala:26) > at > org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:913) > at > org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:922) > at > org.apache.spark.storage.MemoryStore.getBytes(MemoryStore.scala:102) > at > org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:348) > at > org.apache.spark.storage.BlockManager.getLocalBytes(BlockManager.scala:323) > at > org.apache.spark.storage.BlockManagerWorker.getBlock(BlockManagerWorker.scala:90) > at > org.apache.spark.storage.BlockManagerWorker.processBlockMessage(BlockManagerWorker.scala:69) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$2.apply(BlockManagerWorker.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at > org.apache.spark.storage.BlockMessageArray.foreach(BlockMessageArray.scala:28) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at > org.apache.spark.storage.BlockMessageArray.map(BlockMessageArray.scala:28) > at > org.apache.spark.storage.BlockManagerWorker.onBlockMessageReceive(BlockManagerWorker.scala:44) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.storage.BlockManagerWorker$$anonfun$1.apply(BlockManagerWorker.scala:34) > at > org.apache.spark.network.ConnectionManager.org$apache$spark$network$ConnectionManager$$handleMessage(ConnectionManager.scala:661) > at > org.apache.spark.network.ConnectionManager$$anon$9.run(ConnectionManager.scala:503) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1437) Jenkins should build with Java 6
Sean Owen created SPARK-1437: Summary: Jenkins should build with Java 6 Key: SPARK-1437 URL: https://issues.apache.org/jira/browse/SPARK-1437 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Priority: Minor Apologies if this was already on someone's to-do list, but I wanted to track this, as it bit two commits in the last few weeks. Spark is intended to work with Java 6, and so compiles with source/target 1.6. Java 7 can correctly enforce Java 6 language rules and emit Java 6 bytecode. However, unless otherwise configured with -bootclasspath, javac will use its own (Java 7) library classes. This means code that uses classes in Java 7 will be allowed to compile, but the result will fail when run on Java 6. This is why you get warnings like ... Using /usr/java/jdk1.7.0_51 as default JAVA_HOME. ... [warn] warning: [options] bootstrap class path not set in conjunction with -source 1.6 The solution is just to tell Jenkins to use Java 6. This may be stating the obvious, but it should just be a setting under "Configure" for SparkPullRequestBuilder. In our Jenkinses, JDK 6/7/8 are set up; if it's not an option already I'm guessing it's not too hard to get Java 6 configured on the Amplab machines. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1427) HQL Examples Don't Work
[ https://issues.apache.org/jira/browse/SPARK-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1427. - Resolution: Fixed Fixed the toString issue here: https://github.com/apache/spark/pull/343 Could not recreate the permgen problem, but I did run the examples by hand successfully. > HQL Examples Don't Work > --- > > Key: SPARK-1427 > URL: https://issues.apache.org/jira/browse/SPARK-1427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0 >Reporter: Patrick Wendell >Assignee: Michael Armbrust > Fix For: 1.0.0 > > > {code} > scala> hql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") > 14/04/05 22:40:29 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT > EXISTS src (key INT, value STRING) > 14/04/05 22:40:30 INFO ParseDriver: Parse Completed > 14/04/05 22:40:30 INFO Driver: > 14/04/05 22:40:30 INFO Driver: > 14/04/05 22:40:30 INFO Driver: > 14/04/05 22:40:30 INFO Driver: > 14/04/05 22:40:30 INFO ParseDriver: Parsing command: CREATE TABLE IF NOT > EXISTS src (key INT, value STRING) > 14/04/05 22:40:30 INFO ParseDriver: Parse Completed > 14/04/05 22:40:30 INFO Driver: end=1396762830163 duration=1> > 14/04/05 22:40:30 INFO Driver: > 14/04/05 22:40:30 INFO SemanticAnalyzer: Starting Semantic Analysis > 14/04/05 22:40:30 INFO SemanticAnalyzer: Creating table src position=27 > 14/04/05 22:40:30 INFO HiveMetaStore: 0: Opening raw store with implemenation > class:org.apache.hadoop.hive.metastore.ObjectStore > 14/04/05 22:40:30 INFO ObjectStore: ObjectStore, initialize called > 14/04/05 22:40:30 INFO Persistence: Property datanucleus.cache.level2 unknown > - will be ignored > 14/04/05 22:40:30 WARN BoneCPConfig: Max Connections < 1. Setting to 20 > 14/04/05 22:40:32 INFO ObjectStore: Setting MetaStore object pin classes with > hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order" > 14/04/05 22:40:32 INFO ObjectStore: Initialized ObjectStore > 14/04/05 22:40:33 WARN BoneCPConfig: Max Connections < 1. Setting to 20 > 14/04/05 22:40:33 INFO HiveMetaStore: 0: get_table : db=default tbl=src > 14/04/05 22:40:33 INFO audit: ugi=patrick ip=unknown-ip-addr > cmd=get_table : db=default tbl=src > 14/04/05 22:40:33 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as > "embedded-only" so does not have its own datastore table. > 14/04/05 22:40:33 INFO Datastore: The class > "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" > so does not have its own datastore table. > 14/04/05 22:40:34 INFO Driver: Semantic Analysis Completed > 14/04/05 22:40:34 INFO Driver: start=1396762830163 end=1396762834001 duration=3838> > 14/04/05 22:40:34 INFO Driver: Returning Hive schema: > Schema(fieldSchemas:null, properties:null) > 14/04/05 22:40:34 INFO Driver: end=1396762834006 duration=3860> > 14/04/05 22:40:34 INFO Driver: > 14/04/05 22:40:34 INFO Driver: Starting command: CREATE TABLE IF NOT EXISTS > src (key INT, value STRING) > 14/04/05 22:40:34 INFO Driver: start=1396762830146 end=1396762834016 duration=3870> > 14/04/05 22:40:34 INFO Driver: > 14/04/05 22:40:34 INFO Driver: end=1396762834016 duration=0> > 14/04/05 22:40:34 INFO Driver: start=1396762834006 end=1396762834017 duration=11> > 14/04/05 22:40:34 INFO Driver: OK > 14/04/05 22:40:34 INFO Driver: > 14/04/05 22:40:34 INFO Driver: start=1396762834019 end=1396762834019 duration=0> > 14/04/05 22:40:34 INFO Driver: start=1396762830146 end=1396762834019 duration=3873> > 14/04/05 22:40:34 INFO Driver: > 14/04/05 22:40:34 INFO Driver: start=1396762834019 end=1396762834020 duration=1> > java.lang.AssertionError: assertion failed: No plan for NativeCommand CREATE > TABLE IF NOT EXISTS src (key INT, value STRING) > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:218) > at > org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:218) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:219) > at > org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:219) > at > org.apache.spark.sql.SchemaRDDLike$class.toString(SchemaRDDLike.scala:44) > at org.apache.spark.sql.SchemaRDD.toString(SchemaRDD.scala:93) > at java.lang.String.valueOf(String.java:2854) > at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:331) > at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337) > at .(:10) > at .() > at $print() >
[jira] [Updated] (SPARK-1099) Allow inferring number of cores with local[*]
[ https://issues.apache.org/jira/browse/SPARK-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1099: --- Summary: Allow inferring number of cores with local[*] (was: Spark's local mode should respect spark.cores.max by default) > Allow inferring number of cores with local[*] > - > > Key: SPARK-1099 > URL: https://issues.apache.org/jira/browse/SPARK-1099 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Minor > Fix For: 1.0.0 > > > It seems reasonable that the default number of cores used by spark's local > mode (when no value is specified) is drawn from the spark.cores.max > configuration parameter (which, conveniently, is now settable as a > command-line option in spark-shell). > For the sake of consistency, it's probable that this change would also entail > making the default number of cores when spark.cores.max is NOT specified to > be as many logical cores are on the machine (which is what standalone mode > does). This too seems reasonable, as Spark is inherently a distributed system > and I think it's expected that it should use multiple cores by default. > However, it is a behavioral change, and thus requires caution. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1099) Spark's local mode should respect spark.cores.max by default
[ https://issues.apache.org/jira/browse/SPARK-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1099: --- Summary: Spark's local mode should respect spark.cores.max by default (was: Spark's local mode should probably respect spark.cores.max by default) > Spark's local mode should respect spark.cores.max by default > > > Key: SPARK-1099 > URL: https://issues.apache.org/jira/browse/SPARK-1099 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Minor > Fix For: 1.0.0 > > > It seems reasonable that the default number of cores used by spark's local > mode (when no value is specified) is drawn from the spark.cores.max > configuration parameter (which, conveniently, is now settable as a > command-line option in spark-shell). > For the sake of consistency, it's probable that this change would also entail > making the default number of cores when spark.cores.max is NOT specified to > be as many logical cores are on the machine (which is what standalone mode > does). This too seems reasonable, as Spark is inherently a distributed system > and I think it's expected that it should use multiple cores by default. > However, it is a behavioral change, and thus requires caution. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1035) Use a single mechanism for distributing jars on Yarn
[ https://issues.apache.org/jira/browse/SPARK-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-1035. --- Resolution: Won't Fix When I originally filed this, I didn't realize that jars could be added at runtime. In light of this, I don't think we can do much better than the current state of things. > Use a single mechanism for distributing jars on Yarn > > > Key: SPARK-1035 > URL: https://issues.apache.org/jira/browse/SPARK-1035 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 0.9.0 >Reporter: Sandy Pérez González > > When running Spark on Yarn, the app jar is distributed through a different > mechanism than additional added jars. The app jar gets to every worker node > as a Yarn local resource. Additional jars only get to the app master, and the > app master serves them to workers with the HTTP file server. Strangeness > comes when an application addJar's the app jar, which is a natural thing to > do in mesos or standalone mode, but in Yarn mode, will try to distribute the > same jar through a different mechanism. Using the same mechanism for both > would eliminate this issue, as well as greatly simplify debugging > ClassNotFoundExceptions in workers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1059) Now that we submit core requests to YARN, fix usage message in ClientArguments
[ https://issues.apache.org/jira/browse/SPARK-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-1059. --- Resolution: Duplicate This got fixed in Tom's security patch. > Now that we submit core requests to YARN, fix usage message in ClientArguments > -- > > Key: SPARK-1059 > URL: https://issues.apache.org/jira/browse/SPARK-1059 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Sandy Pérez González >Priority: Minor > > "Number of cores for the workers (Default: 1). This is unsused right now." -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1209) SparkHadoopUtil should not use package org.apache.hadoop
[ https://issues.apache.org/jira/browse/SPARK-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza resolved SPARK-1209. --- Resolution: Fixed > SparkHadoopUtil should not use package org.apache.hadoop > > > Key: SPARK-1209 > URL: https://issues.apache.org/jira/browse/SPARK-1209 > Project: Spark > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Sandy Pérez González >Assignee: Mark Grover > > It's private, so the change won't break compatibility -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1101) Umbrella for hardening Spark on YARN
[ https://issues.apache.org/jira/browse/SPARK-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-1101: - Assignee: Sandy Ryza (was: Sandy Pérez González) > Umbrella for hardening Spark on YARN > > > Key: SPARK-1101 > URL: https://issues.apache.org/jira/browse/SPARK-1101 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 0.9.0 >Reporter: Sandy Pérez González >Assignee: Sandy Ryza > > This is an umbrella JIRA to track near-term improvements for Spark on YARN. > I don't think huge changes are required - just fixing some bugs, plugging > usability gaps, and enhancing documentation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1409) Flaky Test: "actor input stream" test in org.apache.spark.streaming.InputStreamsSuite
[ https://issues.apache.org/jira/browse/SPARK-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962171#comment-13962171 ] Patrick Wendell commented on SPARK-1409: I've disabled this test for now. > Flaky Test: "actor input stream" test in > org.apache.spark.streaming.InputStreamsSuite > - > > Key: SPARK-1409 > URL: https://issues.apache.org/jira/browse/SPARK-1409 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Michael Armbrust >Assignee: Tathagata Das > > Here are just a few cases: > https://travis-ci.org/apache/spark/jobs/22151827 > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13709/ -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1436) Compression code broke in-memory store
[ https://issues.apache.org/jira/browse/SPARK-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962155#comment-13962155 ] Reynold Xin commented on SPARK-1436: Try run the following code: {code} package org.apache.spark.sql import org.apache.spark.sql.test.TestSQLContext._ import org.apache.spark.sql.catalyst.util._ case class Data(a: Int, b: Long) object AggregationBenchmark { def main(args: Array[String]): Unit = { val rdd = sparkContext.parallelize(1 to 20).flatMap(_ => (1 to 50).map(i => Data(i % 100, i))) rdd.registerAsTable("data") cacheTable("data") (1 to 10).foreach { i => println(s"=== ITERATION $i ===") benchmark { println("SELECT COUNT() FROM data:" + sql("SELECT COUNT(*) FROM data").collect().head) } println("SELECT a, SUM(b) FROM data GROUP BY a") benchmark { sql("SELECT a, SUM(b) FROM data GROUP BY a").count() } println("SELECT SUM(b) FROM data") benchmark { sql("SELECT SUM(b) FROM data").count() } } } } {code} The following exception is thrown: {code} java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.sql.columnar.ColumnAccessor$.apply(ColumnAccessor.scala:103) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$execute$1$$anon$1$$anonfun$3.apply(InMemoryColumnarTableScan.scala:61) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$execute$1$$anon$1$$anonfun$3.apply(InMemoryColumnarTableScan.scala:61) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$execute$1$$anon$1.(InMemoryColumnarTableScan.scala:61) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$execute$1.apply(InMemoryColumnarTableScan.scala:60) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$execute$1.apply(InMemoryColumnarTableScan.scala:56) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:504) at org.apache.spark.rdd.RDD$$anonfun$3.apply(RDD.scala:504) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:52) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:46) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 14/04/07 12:07:38 WARN TaskSetManager: Lost TID 3 (task 4.0:0) 14/04/07 12:07:38 WARN TaskSetManager: Loss was due to java.nio.BufferUnderflowException java.nio.BufferUnderflowException at java.nio.Buffer.nextGetIndex(Buffer.java:498) at java.nio.HeapByteBuffer.getInt(HeapByteBuffer.java:355) at org.apache.spark.sql.columnar.ColumnAccessor$.apply(ColumnAccessor.scala:103) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$execute$1$$anon$1$$anonfun$3.apply(InMemoryColumnarTableScan.scala:61) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$execute$1$$anon$1$$anonfun$3.apply(InMemoryColumnarTableScan.scala:61) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(
[jira] [Created] (SPARK-1436) Compression code broke in-memory store
Reynold Xin created SPARK-1436: -- Summary: Compression code broke in-memory store Key: SPARK-1436 URL: https://issues.apache.org/jira/browse/SPARK-1436 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Cheng Lian Priority: Blocker Fix For: 1.0.0 See my following comment... -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962133#comment-13962133 ] Sean Owen commented on SPARK-1406: -- PMML is the de facto serialization, so certainly the one to consider leveraging. It's just a serialization, so it's not by itself going to help with feature transformation. Given data and PMML, it's fairly easy to use things like JPMML to do evaluation. You could write some thin wrapper code in MLlib to facilitate that, but it may not give a lot of marginal benefit. Import/export is a bit different. Again JPMML will do all the mechanisms of serializing an object model, so that need not be written. I think export is more important than import, mostly because I think of MLlib as a model builder, and therefore a producer rather than consumer of models. Export is also easier since you just need to write the glue code to translate some MLlib object into a JPMML representation, and only need to worry about dealing with the subset of PMML that covers whatever the MLlib output describes. Import is harder for the same reason -- you're not going to want to or be able to support everything PMML can describe, so it's already a question of trying to map the vocab as best you can to whatever MLlib supports. It's also less important, IMHO, since MLlib's value is more in making the model than doing something with it right now. I would suggest the import/export stuff be kept close, but separate, to the other MLlib code. Not a different module, just cleanly separated from the abstract representation. I think there's a whole project's worth of stuff one could do around consuming, managing, serving models! So to summarize: I'd suggest scoping this to start as "wire up all *Model files to JPMML equivalents, as an 'export' package" or something. > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1390) Refactor RDD backed matrices
[ https://issues.apache.org/jira/browse/SPARK-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1390: - Fix Version/s: 1.0.0 > Refactor RDD backed matrices > > > Key: SPARK-1390 > URL: https://issues.apache.org/jira/browse/SPARK-1390 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Blocker > Fix For: 1.0.0 > > > The current interfaces of RDD backed matrices needs refactoring for v1.0 > release. It would be better if we have a clear separation of local matrices > and those backed by RDD. Right now, we have > 1. org.apache.spark.mllib.linalg.SparseMatrix, which is a wrapper over an RDD > of matrix entries, i.e., coordinate list format. > 2. org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix, which is a wrapper > over RDD[Array[Double]], i.e. row-oriented format. > We will see naming collision when we introduce local SparseMatrix and the > name TallSkinnyDenseMatrix is not exact if we switch to RDD[Vector] instead > of RDD[Array[Double]]. It would be better to have "RDD" in the type name to > suggest that operations will trigger a job. > The proposed names (all under org.apache.spark.mllib.linalg.rdd): > 1. RDDMatrix: trait for matrices backed by one or more RDDs > 2. CoordinateRDDMatrix: wrapper of RDD[RDDMatrixEntry] > 3. RowRDDMatrix: wrapper of RDD[Vector] whose rows do not have special > ordering > 4. IndexedRowRDDMatrix: wrapper of RDD[(Long, Vector)] whose rows are > associated with indices > The proposal is subject to charge, but it would be nice to make the changes > before v1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1252) On YARN, use container-log4j.properties for executors
[ https://issues.apache.org/jira/browse/SPARK-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-1252. -- Resolution: Fixed > On YARN, use container-log4j.properties for executors > - > > Key: SPARK-1252 > URL: https://issues.apache.org/jira/browse/SPARK-1252 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 0.9.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Critical > Fix For: 1.0.0 > > > YARN provides a log4j.properties file that's distinct from the NodeManager > log4j.properties. Containers are supposed to use this so that they don't try > to write to the NodeManager log file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1252) On YARN, use container-log4j.properties for executors
[ https://issues.apache.org/jira/browse/SPARK-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962109#comment-13962109 ] Thomas Graves commented on SPARK-1252: -- https://github.com/apache/spark/pull/148 > On YARN, use container-log4j.properties for executors > - > > Key: SPARK-1252 > URL: https://issues.apache.org/jira/browse/SPARK-1252 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 0.9.0 >Reporter: Sandy Ryza >Assignee: Sandy Ryza >Priority: Critical > Fix For: 1.0.0 > > > YARN provides a log4j.properties file that's distinct from the NodeManager > log4j.properties. Containers are supposed to use this so that they don't try > to write to the NodeManager log file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1214) 0-1 labels
[ https://issues.apache.org/jira/browse/SPARK-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1214. -- Resolution: Fixed Fix Version/s: 0.9.0 Assignee: Xiangrui Meng (was: Shashidhar E S) Fixed in 0.9.0 or an earlier version. > 0-1 labels > --- > > Key: SPARK-1214 > URL: https://issues.apache.org/jira/browse/SPARK-1214 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Ameet Talwalkar >Assignee: Xiangrui Meng > Fix For: 0.9.0 > > > Use \{0,1\} labels for binary classification instead of {-1,1}. Advantages > include: > (+) Consistency across algorithms > (+) Naturally extends to multi-class classification -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1217) Add proximal gradient updater.
[ https://issues.apache.org/jira/browse/SPARK-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1217. -- Resolution: Fixed Fix Version/s: 0.9.0 > Add proximal gradient updater. > -- > > Key: SPARK-1217 > URL: https://issues.apache.org/jira/browse/SPARK-1217 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Ameet Talwalkar > Fix For: 0.9.0 > > > Add proximal gradient updater, in particular for L1 regularization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1435) Don't assume context class loader is set when creating classes via reflection
Patrick Wendell created SPARK-1435: -- Summary: Don't assume context class loader is set when creating classes via reflection Key: SPARK-1435 URL: https://issues.apache.org/jira/browse/SPARK-1435 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Blocker -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1435) Don't assume context class loader is set when creating classes via reflection
[ https://issues.apache.org/jira/browse/SPARK-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962061#comment-13962061 ] Patrick Wendell commented on SPARK-1435: SPARK-1403 provides a work around in the case of mesos, but in general we should just avoid making this assumption. > Don't assume context class loader is set when creating classes via reflection > - > > Key: SPARK-1435 > URL: https://issues.apache.org/jira/browse/SPARK-1435 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Patrick Wendell >Assignee: Patrick Wendell >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1222) Logistic Regression (+ regularized variants)
[ https://issues.apache.org/jira/browse/SPARK-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1222. -- Resolution: Fixed Fix Version/s: 0.9.0 Implemented in 0.9.0 or an earlier version. > Logistic Regression (+ regularized variants) > > > Key: SPARK-1222 > URL: https://issues.apache.org/jira/browse/SPARK-1222 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ameet Talwalkar >Assignee: Shivaram Venkataraman > Fix For: 0.9.0 > > > Implement Logistic Regression using the SGD optimization primitives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1223) Linear Regression (+ regularized variants)
[ https://issues.apache.org/jira/browse/SPARK-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1223. -- Resolution: Fixed Fix Version/s: 0.9.0 Implemented in 0.9.0 or an earlier version. > Linear Regression (+ regularized variants) > -- > > Key: SPARK-1223 > URL: https://issues.apache.org/jira/browse/SPARK-1223 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ameet Talwalkar >Assignee: Shivaram Venkataraman > Fix For: 0.9.0 > > > Implement Linear regression using the SGD optimization primitives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1219) Minibatch SGD with disjoint partitions
[ https://issues.apache.org/jira/browse/SPARK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1219: - Fix Version/s: 0.9.0 > Minibatch SGD with disjoint partitions > -- > > Key: SPARK-1219 > URL: https://issues.apache.org/jira/browse/SPARK-1219 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ameet Talwalkar > Fix For: 0.9.0 > > > Takes a gradient function as input. At each iteration, we run stochastic > gradient descent locally on each worker with a fraction (alpha) of the data > points selected randomly and disjointly (i.e., we ensure that we touch all > datapoints after at most 1/alpha iterations). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1218) Minibatch SGD with random sampling
[ https://issues.apache.org/jira/browse/SPARK-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1218. -- Resolution: Fixed Fix Version/s: 0.9.0 Fixed in 0.9.0 or an earlier version. > Minibatch SGD with random sampling > -- > > Key: SPARK-1218 > URL: https://issues.apache.org/jira/browse/SPARK-1218 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ameet Talwalkar >Assignee: Shivaram Venkataraman > Fix For: 0.9.0 > > > Takes a gradient function as input. At each iteration, we run stochastic > gradient descent locally on each worker with a fraction of the data points > selected randomly and with replacement (i.e., sampled points may overlap > across iterations). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1221) SVMs (+ regularized variants)
[ https://issues.apache.org/jira/browse/SPARK-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1221. -- Resolution: Fixed Fix Version/s: 0.9.0 Implemented in 0.9.0 or an earlier version. > SVMs (+ regularized variants) > - > > Key: SPARK-1221 > URL: https://issues.apache.org/jira/browse/SPARK-1221 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ameet Talwalkar >Assignee: Shivaram Venkataraman > Fix For: 0.9.0 > > > Implement Support Vector Machines using the SGD optimization primitives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1219) Minibatch SGD with disjoint partitions
[ https://issues.apache.org/jira/browse/SPARK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1219. -- Resolution: Fixed Implemented in 0.9.0 or an earlier version. > Minibatch SGD with disjoint partitions > -- > > Key: SPARK-1219 > URL: https://issues.apache.org/jira/browse/SPARK-1219 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ameet Talwalkar > > Takes a gradient function as input. At each iteration, we run stochastic > gradient descent locally on each worker with a fraction (alpha) of the data > points selected randomly and disjointly (i.e., we ensure that we touch all > datapoints after at most 1/alpha iterations). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962048#comment-13962048 ] Xiangrui Meng commented on SPARK-1406: -- I think we should support PMML import/export in MLlib. PMML also provides feature transformations, which MLlib has very limited support at this time. The question is 1) how we take leverage on existing PMML packages, 2) how many people volunteer. Sean, it would be super helpful if you can share some experience on Oryx's PMML support, since I'm also not sure about whether this is the right time to start. > PMML model evaluation support via MLib > -- > > Key: SPARK-1406 > URL: https://issues.apache.org/jira/browse/SPARK-1406 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Thomas Darimont > > It would be useful if spark would provide support the evaluation of PMML > models (http://www.dmg.org/v4-2/GeneralStructure.html). > This would allow to use analytical models that were created with a > statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which > would perform the actual model evaluation for a given input tuple. The PMML > model would then just contain the "parameterization" of an analytical model. > Other projects like JPMML-Evaluator do a similar thing. > https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962032#comment-13962032 ] Patrick Wendell commented on SPARK-1403: The underlying issue here is that we've made assumptions in various parts of the codebase that the context classloader is set on a thread. In general, we should relax these assumptions and just fallback to the classloader that loaded Spark. As a workaround this patch: https://github.com/apache/spark/pull/322/files just manually sets the classloader to the system class loader. > Spark on Mesos does not set Thread's context class loader > - > > Key: SPARK-1403 > URL: https://issues.apache.org/jira/browse/SPARK-1403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: ubuntu 12.04 on vagrant >Reporter: Bharath Bhushan >Priority: Blocker > > I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark > executor on mesos slave throws a java.lang.ClassNotFoundException for > org.apache.spark.serializer.JavaSerializer. > The lengthy discussion is here: > http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1403) Mesos on Spark does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1403: --- Summary: Mesos on Spark does not set Thread's context class loader (was: java.lang.ClassNotFoundException - spark on mesos) > Mesos on Spark does not set Thread's context class loader > - > > Key: SPARK-1403 > URL: https://issues.apache.org/jira/browse/SPARK-1403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: ubuntu 12.04 on vagrant >Reporter: Bharath Bhushan >Priority: Blocker > > I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark > executor on mesos slave throws a java.lang.ClassNotFoundException for > org.apache.spark.serializer.JavaSerializer. > The lengthy discussion is here: > http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1403: --- Summary: Spark on Mesos does not set Thread's context class loader (was: Mesos on Spark does not set Thread's context class loader) > Spark on Mesos does not set Thread's context class loader > - > > Key: SPARK-1403 > URL: https://issues.apache.org/jira/browse/SPARK-1403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 > Environment: ubuntu 12.04 on vagrant >Reporter: Bharath Bhushan >Priority: Blocker > > I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark > executor on mesos slave throws a java.lang.ClassNotFoundException for > org.apache.spark.serializer.JavaSerializer. > The lengthy discussion is here: > http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't
[ https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962026#comment-13962026 ] Matei Zaharia commented on SPARK-1021: -- Note that if we do this, we'll need a similar fix in Python, which may be trickier. > sortByKey() launches a cluster job when it shouldn't > > > Key: SPARK-1021 > URL: https://issues.apache.org/jira/browse/SPARK-1021 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.8.0, 0.9.0 >Reporter: Andrew Ash > Labels: starter > > The sortByKey() method is listed as a transformation, not an action, in the > documentation. But it launches a cluster job regardless. > http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html > Some discussion on the mailing list suggested that this is a problem with the > rdd.count() call inside Partitioner.scala's rangeBounds method. > https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102 > Josh Rosen suggests that rangeBounds should be made into a lazy variable: > {quote} > I wonder whether making RangePartitoner .rangeBounds into a lazy val would > fix this > (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). > We'd need to make sure that rangeBounds() is never called before an action > is performed. This could be tricky because it's called in the > RangePartitioner.equals() method. Maybe it's sufficient to just compare the > number of partitions, the ids of the RDDs used to create the > RangePartitioner, and the sort ordering. This still supports the case where > I range-partition one RDD and pass the same partitioner to a different RDD. > It breaks support for the case where two range partitioners created on > different RDDs happened to have the same rangeBounds(), but it seems unlikely > that this would really harm performance since it's probably unlikely that the > range partitioners are equal by chance. > {quote} > Can we please make this happen? I'll send a PR on GitHub to start the > discussion and testing. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1432) Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker
[ https://issues.apache.org/jira/browse/SPARK-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1432: --- Assignee: Davis Shepherd > Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker > - > > Key: SPARK-1432 > URL: https://issues.apache.org/jira/browse/SPARK-1432 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 >Reporter: Davis Shepherd >Assignee: Davis Shepherd > Fix For: 1.0.0, 0.9.2 > > > JobProgressTracker continuously cleans up old metadata as per the > spark.ui.retainedStages configuration parameter. It seems however that not > all metadata maps are being cleaned, in particular stageIdToExecutorSummaries > could grow in an unbounded manner in a long running application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1432) Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker
[ https://issues.apache.org/jira/browse/SPARK-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962022#comment-13962022 ] Patrick Wendell commented on SPARK-1432: https://github.com/apache/spark/pull/338 > Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker > - > > Key: SPARK-1432 > URL: https://issues.apache.org/jira/browse/SPARK-1432 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 >Reporter: Davis Shepherd >Assignee: Davis Shepherd > Fix For: 1.0.0, 0.9.2 > > > JobProgressTracker continuously cleans up old metadata as per the > spark.ui.retainedStages configuration parameter. It seems however that not > all metadata maps are being cleaned, in particular stageIdToExecutorSummaries > could grow in an unbounded manner in a long running application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1432) Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker
[ https://issues.apache.org/jira/browse/SPARK-1432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1432. Resolution: Fixed Fix Version/s: 0.9.2 1.0.0 > Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker > - > > Key: SPARK-1432 > URL: https://issues.apache.org/jira/browse/SPARK-1432 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 0.9.0 >Reporter: Davis Shepherd >Assignee: Davis Shepherd > Fix For: 1.0.0, 0.9.2 > > > JobProgressTracker continuously cleans up old metadata as per the > spark.ui.retainedStages configuration parameter. It seems however that not > all metadata maps are being cleaned, in particular stageIdToExecutorSummaries > could grow in an unbounded manner in a long running application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1434) Make labelParser Java friendly.
Xiangrui Meng created SPARK-1434: Summary: Make labelParser Java friendly. Key: SPARK-1434 URL: https://issues.apache.org/jira/browse/SPARK-1434 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Fix For: 1.0.0 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1434) Make labelParser Java friendly.
[ https://issues.apache.org/jira/browse/SPARK-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1434: - Component/s: MLlib > Make labelParser Java friendly. > --- > > Key: SPARK-1434 > URL: https://issues.apache.org/jira/browse/SPARK-1434 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > Fix For: 1.0.0 > > > MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java > users won't like it. So I make a trait for LabelParser and provide two > implementations: binary and multiclass. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
[ https://issues.apache.org/jira/browse/SPARK-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962011#comment-13962011 ] Sandeep Singh commented on SPARK-1433: -- Sorry a typo > Upgrade Mesos dependency to 0.17.0 > -- > > Key: SPARK-1433 > URL: https://issues.apache.org/jira/browse/SPARK-1433 > Project: Spark > Issue Type: Task >Reporter: Sandeep Singh >Priority: Minor > > Mesos 0.14.0 was released 6 months ago. > Upgrade Mesos dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
[ https://issues.apache.org/jira/browse/SPARK-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-1433: - Description: Mesos 0.14.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 was: HBase 0.14.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 > Upgrade Mesos dependency to 0.17.0 > -- > > Key: SPARK-1433 > URL: https://issues.apache.org/jira/browse/SPARK-1433 > Project: Spark > Issue Type: Task >Reporter: Sandeep Singh >Priority: Minor > > Mesos 0.14.0 was released 6 months ago. > Upgrade Mesos dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
[ https://issues.apache.org/jira/browse/SPARK-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-1433: - Description: HBase 0.14.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0 was: HBase 0.14.0 was released 6 months ago. Upgrade HBase dependency to 0.17.0 > Upgrade Mesos dependency to 0.17.0 > -- > > Key: SPARK-1433 > URL: https://issues.apache.org/jira/browse/SPARK-1433 > Project: Spark > Issue Type: Task >Reporter: Sandeep Singh >Priority: Minor > > HBase 0.14.0 was released 6 months ago. > Upgrade Mesos dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1223) Linear Regression (+ regularized variants)
[ https://issues.apache.org/jira/browse/SPARK-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962005#comment-13962005 ] Martin Jaggi commented on SPARK-1223: - is resolved, right? > Linear Regression (+ regularized variants) > -- > > Key: SPARK-1223 > URL: https://issues.apache.org/jira/browse/SPARK-1223 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ameet Talwalkar >Assignee: Shivaram Venkataraman > > Implement Linear regression using the SGD optimization primitives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
[ https://issues.apache.org/jira/browse/SPARK-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962006#comment-13962006 ] Sean Owen commented on SPARK-1433: -- You mean Mesos? > Upgrade Mesos dependency to 0.17.0 > -- > > Key: SPARK-1433 > URL: https://issues.apache.org/jira/browse/SPARK-1433 > Project: Spark > Issue Type: Task >Reporter: Sandeep Singh >Priority: Minor > > HBase 0.14.0 was released 6 months ago. > Upgrade HBase dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1221) SVMs (+ regularized variants)
[ https://issues.apache.org/jira/browse/SPARK-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962008#comment-13962008 ] Martin Jaggi commented on SPARK-1221: - is resolved, right? > SVMs (+ regularized variants) > - > > Key: SPARK-1221 > URL: https://issues.apache.org/jira/browse/SPARK-1221 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ameet Talwalkar >Assignee: Shivaram Venkataraman > > Implement Support Vector Machines using the SGD optimization primitives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1222) Logistic Regression (+ regularized variants)
[ https://issues.apache.org/jira/browse/SPARK-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962007#comment-13962007 ] Martin Jaggi commented on SPARK-1222: - is resolved, right? > Logistic Regression (+ regularized variants) > > > Key: SPARK-1222 > URL: https://issues.apache.org/jira/browse/SPARK-1222 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ameet Talwalkar >Assignee: Shivaram Venkataraman > > Implement Logistic Regression using the SGD optimization primitives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1217) Add proximal gradient updater.
[ https://issues.apache.org/jira/browse/SPARK-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13962004#comment-13962004 ] Martin Jaggi commented on SPARK-1217: - The L1 updater is already proximal, as in the current code. Since it has no effect for L2, we could mark the issue as resolved for now. > Add proximal gradient updater. > -- > > Key: SPARK-1217 > URL: https://issues.apache.org/jira/browse/SPARK-1217 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Ameet Talwalkar > > Add proximal gradient updater, in particular for L1 regularization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1433) Upgrade Mesos dependency to 0.17.0
Sandeep Singh created SPARK-1433: Summary: Upgrade Mesos dependency to 0.17.0 Key: SPARK-1433 URL: https://issues.apache.org/jira/browse/SPARK-1433 Project: Spark Issue Type: Task Reporter: Sandeep Singh Priority: Minor HBase 0.14.0 was released 6 months ago. Upgrade HBase dependency to 0.17.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1217) Add proximal gradient updater.
[ https://issues.apache.org/jira/browse/SPARK-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M J updated SPARK-1217: --- Comment: was deleted (was: The L1 updater is already proximal, as in the current code. Since it has no effect for L2, we could mark the issue as resolved for now.) > Add proximal gradient updater. > -- > > Key: SPARK-1217 > URL: https://issues.apache.org/jira/browse/SPARK-1217 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Ameet Talwalkar > > Add proximal gradient updater, in particular for L1 regularization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1217) Add proximal gradient updater.
[ https://issues.apache.org/jira/browse/SPARK-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961993#comment-13961993 ] M J commented on SPARK-1217: The L1 updater is already proximal, as in the current code. Since it has no effect for L2, we could mark the issue as resolved for now. > Add proximal gradient updater. > -- > > Key: SPARK-1217 > URL: https://issues.apache.org/jira/browse/SPARK-1217 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Ameet Talwalkar > > Add proximal gradient updater, in particular for L1 regularization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1432) Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker
Davis Shepherd created SPARK-1432: - Summary: Potential memory leak in stageIdToExecutorSummaries in JobProgressTracker Key: SPARK-1432 URL: https://issues.apache.org/jira/browse/SPARK-1432 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 0.9.0 Reporter: Davis Shepherd JobProgressTracker continuously cleans up old metadata as per the spark.ui.retainedStages configuration parameter. It seems however that not all metadata maps are being cleaned, in particular stageIdToExecutorSummaries could grow in an unbounded manner in a long running application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1420) The maven build error for Spark Catalyst
[ https://issues.apache.org/jira/browse/SPARK-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] witgo closed SPARK-1420. Resolution: Fixed > The maven build error for Spark Catalyst > > > Key: SPARK-1420 > URL: https://issues.apache.org/jira/browse/SPARK-1420 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: witgo > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Issue Comment Deleted] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine
[ https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Singh updated SPARK-1422: - Comment: was deleted (was: It will be similar to ec2 script ?) > Add scripts for launching Spark on Google Compute Engine > > > Key: SPARK-1422 > URL: https://issues.apache.org/jira/browse/SPARK-1422 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1420) The maven build error for Spark Catalyst
[ https://issues.apache.org/jira/browse/SPARK-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] witgo updated SPARK-1420: - Fix Version/s: 1.0.0 > The maven build error for Spark Catalyst > > > Key: SPARK-1420 > URL: https://issues.apache.org/jira/browse/SPARK-1420 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: witgo > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1422) Add scripts for launching Spark on Google Compute Engine
[ https://issues.apache.org/jira/browse/SPARK-1422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961922#comment-13961922 ] Sandeep Singh commented on SPARK-1422: -- It will be similar to ec2 script ? > Add scripts for launching Spark on Google Compute Engine > > > Key: SPARK-1422 > URL: https://issues.apache.org/jira/browse/SPARK-1422 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Matei Zaharia > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1417) Spark on Yarn - spark UI link from resourcemanager is broken
[ https://issues.apache.org/jira/browse/SPARK-1417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13961858#comment-13961858 ] Thomas Graves commented on SPARK-1417: -- https://github.com/apache/spark/pull/344 > Spark on Yarn - spark UI link from resourcemanager is broken > > > Key: SPARK-1417 > URL: https://issues.apache.org/jira/browse/SPARK-1417 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Blocker > > When running spark on yarn in yarn-cluster mode, spark registers a url with > the Yarn ResourceManager to point to the spark UI. This link is now broken. > The link should be something like: < resourcemanager >/proxy/< applicationId > > instead its coming back as < resourcemanager >/< host of am:port > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1371) HashAggregate should stream tuples and avoid doing an extra count
[ https://issues.apache.org/jira/browse/SPARK-1371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1371. - Resolution: Fixed > HashAggregate should stream tuples and avoid doing an extra count > - > > Key: SPARK-1371 > URL: https://issues.apache.org/jira/browse/SPARK-1371 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.2#6252)