[jira] [Reopened] (SPARK-6145) ORDER BY fails to resolve nested fields
[ https://issues.apache.org/jira/browse/SPARK-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reopened SPARK-6145: - Assignee: Michael Armbrust > ORDER BY fails to resolve nested fields > --- > > Key: SPARK-6145 > URL: https://issues.apache.org/jira/browse/SPARK-6145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Critical > Fix For: 1.3.0 > > > {code} > sqlContext.jsonRDD(sc.parallelize( > """{"a": {"b": 1}, "c": 1}""" :: Nil)).registerTempTable("nestedOrder") > // Works > sqlContext.sql("SELECT 1 FROM nestedOrder ORDER BY c") > // Fails now > sqlContext.sql("SELECT 1 FROM nestedOrder ORDER BY a.b") > // Fails now > sqlContext.sql("SELECT a.b FROM nestedOrder ORDER BY a.b") > {code} > Relatedly the error message for bad get fields should also include the name > of the field in question. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6315) SparkSQL 1.3.0 (RC3) fails to read parquet file generated by 1.1.1
Michael Armbrust created SPARK-6315: --- Summary: SparkSQL 1.3.0 (RC3) fails to read parquet file generated by 1.1.1 Key: SPARK-6315 URL: https://issues.apache.org/jira/browse/SPARK-6315 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Blocker Parquet files generated by Spark 1.1 have a deprecated representation of the schema. In Spark 1.3 we fail to read these files through the new Parquet code path. We should continue to read these files until we formally deprecate this representation. As a workaround: {code} SET spark.sql.parquet.useDataSourceApi=false {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6279) Miss expressions flag "s" at logging string
[ https://issues.apache.org/jira/browse/SPARK-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360013#comment-14360013 ] zzc commented on SPARK-6279: [~srowen], I am new to Spark and JIRA, Sorry for this > Miss expressions flag "s" at logging string > > > Key: SPARK-6279 > URL: https://issues.apache.org/jira/browse/SPARK-6279 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: zzc >Assignee: zzc >Priority: Trivial > Fix For: 1.4.0 > > > In KafkaRDD.scala, Miss expressions flag "s" at logging string > In logging file, it print `Beginning offset ${part.fromOffset} is the same as > ending offset ` but not `log.warn("Beginning offset 111 is the same as ending > offset "`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6275) Miss toDF() function in docs/sql-programming-guide.md
[ https://issues.apache.org/jira/browse/SPARK-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360012#comment-14360012 ] zzc commented on SPARK-6275: [~srowen], I am new to Spark and JIRA, Sorry for this > Miss toDF() function in docs/sql-programming-guide.md > -- > > Key: SPARK-6275 > URL: https://issues.apache.org/jira/browse/SPARK-6275 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.3.0 >Reporter: zzc >Assignee: zzc >Priority: Trivial > Fix For: 1.4.0 > > > Miss toDF() function in docs/sql-programming-guide.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6314) Failed to load application log data from FileStatus
zzc created SPARK-6314: -- Summary: Failed to load application log data from FileStatus Key: SPARK-6314 URL: https://issues.apache.org/jira/browse/SPARK-6314 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: zzc There are some errors in history server event-log directory while a job is running: {quote} com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was expecting closing '"' for name at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName2(ReaderBasedJsonParser.java:1284) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName(ReaderBasedJsonParser.java:1268) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:618) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:43) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42) at org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:49) at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:260) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$6.apply(FsHistoryProvider.scala:190) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$6.apply(FsHistoryProvider.scala:188) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:188) at org.apache.spark.deploy.history.FsHistoryProvider$$anon$1$$anonfun$run$1.apply$mcV$sp(FsHistoryProvider.scala:94) at org.apache.spark.deploy.history.FsHistoryProvider$$anon$1$$anonfun$run$1.apply(FsHistoryProvider.scala:85) at org.apache.spark.deploy.history.FsHistoryProvider$$anon$1$$anonfun$run$1.apply(FsHistoryProvider.scala:85) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617) at org.apache.spark.deploy.history.FsHistoryProvider$$anon$1.run(FsHistoryProvider.scala:84) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount
[ https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6313: --- Priority: Critical (was: Major) > Fetch File Lock file creation doesnt work when Spark working dir is on a NFS > mount > -- > > Key: SPARK-6313 > URL: https://issues.apache.org/jira/browse/SPARK-6313 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0, 1.2.1 >Reporter: Nathan McCarthy >Priority: Critical > > When running in cluster mode and mounting the spark work dir on a NFS volume > (or some volume which doesn't support file locking), the fetchFile (used for > downloading JARs etc on the executors) method in Spark Utils class will fail. > This file locking was introduced as an improvement with SPARK-2713. > See > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415 > > Introduced in 1.2 in commit; > https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 > As this locking is for optimisation for fetching files, could we take a > different approach here to create a temp/advisory lock file? > Typically you would just mount local disks (in say ext4 format) and provide > this as a comma separated list however we are trying to run Spark on MapR. > With MapR we can do a loop back mount to a volume on the local node and take > advantage of MapRs disk pools. This also means we dont need specific mounts > for Spark and improves the generic nature of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount
[ https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359972#comment-14359972 ] Nathan McCarthy edited comment on SPARK-6313 at 3/13/15 5:38 AM: - Since the {code}val lockFileName = s"${url.hashCode}${timestamp}_lock"{code} uses a timestamp I can't see there being too many problems with hanging/left over lock files. was (Author: nemccarthy): Since the `val lockFileName = s"${url.hashCode}${timestamp}_lock"` uses a timestamp I can't see there being too many problems with hanging/left over lock files. > Fetch File Lock file creation doesnt work when Spark working dir is on a NFS > mount > -- > > Key: SPARK-6313 > URL: https://issues.apache.org/jira/browse/SPARK-6313 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0, 1.2.1 >Reporter: Nathan McCarthy > > When running in cluster mode and mounting the spark work dir on a NFS volume > (or some volume which doesn't support file locking), the fetchFile (used for > downloading JARs etc on the executors) method in Spark Utils class will fail. > This file locking was introduced as an improvement with SPARK-2713. > See > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415 > > Introduced in 1.2 in commit; > https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 > As this locking is for optimisation for fetching files, could we take a > different approach here to create a temp/advisory lock file? > Typically you would just mount local disks (in say ext4 format) and provide > this as a comma separated list however we are trying to run Spark on MapR. > With MapR we can do a loop back mount to a volume on the local node and take > advantage of MapRs disk pools. This also means we dont need specific mounts > for Spark and improves the generic nature of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount
[ https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nathan McCarthy updated SPARK-6313: --- Affects Version/s: 1.2.0 1.2.1 > Fetch File Lock file creation doesnt work when Spark working dir is on a NFS > mount > -- > > Key: SPARK-6313 > URL: https://issues.apache.org/jira/browse/SPARK-6313 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0, 1.2.1 >Reporter: Nathan McCarthy > > When running in cluster mode and mounting the spark work dir on a NFS volume > (or some volume which doesn't support file locking), the fetchFile (used for > downloading JARs etc on the executors) method in Spark Utils class will fail. > This file locking was introduced as an improvement with SPARK-2713. > See > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415 > > Introduced in 1.2 in commit; > https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 > As this locking is for optimisation for fetching files, could we take a > different approach here to create a temp/advisory lock file? > Typically you would just mount local disks (in say ext4 format) and provide > this as a comma separated list however we are trying to run Spark on MapR. > With MapR we can do a loop back mount to a volume on the local node and take > advantage of MapRs disk pools. This also means we dont need specific mounts > for Spark and improves the generic nature of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount
[ https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359972#comment-14359972 ] Nathan McCarthy commented on SPARK-6313: Since the `val lockFileName = s"${url.hashCode}${timestamp}_lock"` uses a timestamp I can't see there being too many problems with hanging/left over lock files. > Fetch File Lock file creation doesnt work when Spark working dir is on a NFS > mount > -- > > Key: SPARK-6313 > URL: https://issues.apache.org/jira/browse/SPARK-6313 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Nathan McCarthy > > When running in cluster mode and mounting the spark work dir on a NFS volume > (or some volume which doesn't support file locking), the fetchFile (used for > downloading JARs etc on the executors) method in Spark Utils class will fail. > This file locking was introduced as an improvement with SPARK-2713. > See > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415 > > Introduced in 1.2 in commit; > https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 > As this locking is for optimisation for fetching files, could we take a > different approach here to create a temp/advisory lock file? > Typically you would just mount local disks (in say ext4 format) and provide > this as a comma separated list however we are trying to run Spark on MapR. > With MapR we can do a loop back mount to a volume on the local node and take > advantage of MapRs disk pools. This also means we dont need specific mounts > for Spark and improves the generic nature of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount
[ https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359964#comment-14359964 ] Nathan McCarthy commented on SPARK-6313: Suggestion along the lines of; https://github.com/apache/lucene-solr/blob/5314a56924f46522993baf106e6deca0e48a967f/lucene/core/src/java/org/apache/lucene/store/SimpleFSLockFactory.java or https://github.com/graphhopper/graphhopper/blob/master/core/src/main/java/com/graphhopper/storage/SimpleFSLockFactory.java > Fetch File Lock file creation doesnt work when Spark working dir is on a NFS > mount > -- > > Key: SPARK-6313 > URL: https://issues.apache.org/jira/browse/SPARK-6313 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Nathan McCarthy > > When running in cluster mode and mounting the spark work dir on a NFS volume > (or some volume which doesn't support file locking), the fetchFile (used for > downloading JARs etc on the executors) method in Spark Utils class will fail. > This file locking was introduced as an improvement with SPARK-2713. > See > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415 > > Introduced in 1.2 in commit; > https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 > As this locking is for optimisation for fetching files, could we take a > different approach here to create a temp/advisory lock file? > Typically you would just mount local disks (in say ext4 format) and provide > this as a comma separated list however we are trying to run Spark on MapR. > With MapR we can do a loop back mount to a volume on the local node and take > advantage of MapRs disk pools. This also means we dont need specific mounts > for Spark and improves the generic nature of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359950#comment-14359950 ] Tathagata Das edited comment on SPARK-6222 at 3/13/15 5:09 AM: --- I proposed another way to fix this here https://github.com/apache/spark/pull/5008 Basically, dont clear checkpoint data after the pre-batch-start checkpoint. BTW, super thanks to [~hshreedharan] for painstakingly explaining me offline what the problem was. was (Author: tdas): I proposed another way to fix this here https://github.com/apache/spark/pull/5008 Basically, dont clear checkpoint data after the pre-batch-start checkpoint. BTW, super thanks to [~hshreedharan] for painstakingly explaining me offline what the problem was. I > [STREAMING] All data may not be recovered from WAL when driver is killed > > > Key: SPARK-6222 > URL: https://issues.apache.org/jira/browse/SPARK-6222 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Hari Shreedharan >Priority: Blocker > Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch > > > When testing for our next release, our internal tests written by [~wypoon] > caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs > FlumePolling stream to read data from Flume, then kills the Application > Master. Once YARN restarts it, the test waits until no more data is to be > written and verifies the original against the data on HDFS. This was passing > in 1.2.0, but is failing now. > Since the test ties into Cloudera's internal infrastructure and build > process, it cannot be directly run on an Apache build. But I have been > working on isolating the commit that may have caused the regression. I have > confirmed that it was caused by SPARK-5147 (PR # > [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several > times using the test and the failure is consistently reproducible. > To re-confirm, I reverted just this one commit (and Clock consolidation one > to avoid conflicts), and the issue was no longer reproducible. > Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 > /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359950#comment-14359950 ] Tathagata Das commented on SPARK-6222: -- I proposed another way to fix this here https://github.com/apache/spark/pull/5008 Basically, dont clear checkpoint data after the pre-batch-start checkpoint. BTW, super thanks to [~hshreedharan] for painstakingly explaining me offline what the problem was. I > [STREAMING] All data may not be recovered from WAL when driver is killed > > > Key: SPARK-6222 > URL: https://issues.apache.org/jira/browse/SPARK-6222 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Hari Shreedharan >Priority: Blocker > Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch > > > When testing for our next release, our internal tests written by [~wypoon] > caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs > FlumePolling stream to read data from Flume, then kills the Application > Master. Once YARN restarts it, the test waits until no more data is to be > written and verifies the original against the data on HDFS. This was passing > in 1.2.0, but is failing now. > Since the test ties into Cloudera's internal infrastructure and build > process, it cannot be directly run on an Apache build. But I have been > working on isolating the commit that may have caused the regression. I have > confirmed that it was caused by SPARK-5147 (PR # > [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several > times using the test and the failure is consistently reproducible. > To re-confirm, I reverted just this one commit (and Clock consolidation one > to avoid conflicts), and the issue was no longer reproducible. > Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 > /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed
[ https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359948#comment-14359948 ] Apache Spark commented on SPARK-6222: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/5008 > [STREAMING] All data may not be recovered from WAL when driver is killed > > > Key: SPARK-6222 > URL: https://issues.apache.org/jira/browse/SPARK-6222 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Hari Shreedharan >Priority: Blocker > Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch > > > When testing for our next release, our internal tests written by [~wypoon] > caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs > FlumePolling stream to read data from Flume, then kills the Application > Master. Once YARN restarts it, the test waits until no more data is to be > written and verifies the original against the data on HDFS. This was passing > in 1.2.0, but is failing now. > Since the test ties into Cloudera's internal infrastructure and build > process, it cannot be directly run on an Apache build. But I have been > working on isolating the commit that may have caused the regression. I have > confirmed that it was caused by SPARK-5147 (PR # > [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several > times using the test and the failure is consistently reproducible. > To re-confirm, I reverted just this one commit (and Clock consolidation one > to avoid conflicts), and the issue was no longer reproducible. > Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0 > /cc [~tdas], [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5376) [Mesos] MesosExecutor should have correct resources
[ https://issues.apache.org/jira/browse/SPARK-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359943#comment-14359943 ] Lukasz Jastrzebski commented on SPARK-5376: --- One comment, however if you run multiple Spark applications even tough executor-id == slave-id, multiple executors can be started on the same host. (And every one of them will consume 1 CPU without scheduling any tasks). This can be painful when you want to run multiple streaming applications on Mesos in fine grained mode, because each streaming driver's executors will consume 1 CPU... > [Mesos] MesosExecutor should have correct resources > --- > > Key: SPARK-5376 > URL: https://issues.apache.org/jira/browse/SPARK-5376 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 1.2.0 >Reporter: Jongyoul Lee > > Spark offers task and executor resources. We should fix resources for > executor. As is, same cores as tasks and no memories. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount
Nathan McCarthy created SPARK-6313: -- Summary: Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount Key: SPARK-6313 URL: https://issues.apache.org/jira/browse/SPARK-6313 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Nathan McCarthy When running in cluster mode and mounting the spark work dir on a NFS volume (or some volume which doesn't support file locking), the fetchFile (used for downloading JARs etc on the executors) method in Spark Utils class will fail. This file locking was introduced as an improvement with SPARK-2713. See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415 Introduced in 1.2 in commit; https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 As this locking is for optimisation for fetching files, could we take a different approach here to create a temp/advisory lock file? Typically you would just mount local disks (in say ext4 format) and provide this as a comma separated list however we are trying to run Spark on MapR. With MapR we can do a loop back mount to a volume on the local node and take advantage of MapRs disk pools. This also means we dont need specific mounts for Spark and improves the generic nature of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6311) ChiSqTest should check for too few counts
[ https://issues.apache.org/jira/browse/SPARK-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-6311. Resolution: Duplicate > ChiSqTest should check for too few counts > - > > Key: SPARK-6311 > URL: https://issues.apache.org/jira/browse/SPARK-6311 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > ChiSqTest assumes that elements of the contingency matrix are large enough > (have enough counts) s.t. the central limit theorem kicks in. It would be > reasonable to do one or more of the following: > * Add a note in the docs about making sure there are a reasonable number of > instances being used (or counts in the contingency table entries, to be more > precise and account for skewed category distributions). > * Add a check in the code which could: > ** Log a warning message > ** Alter the p-value to make sure it indicates the test result is > insignificant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6310) ChiSqTest should check for too few counts
[ https://issues.apache.org/jira/browse/SPARK-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-6310. Resolution: Duplicate > ChiSqTest should check for too few counts > - > > Key: SPARK-6310 > URL: https://issues.apache.org/jira/browse/SPARK-6310 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > ChiSqTest assumes that elements of the contingency matrix are large enough > (have enough counts) s.t. the central limit theorem kicks in. It would be > reasonable to do one or more of the following: > * Add a note in the docs about making sure there are a reasonable number of > instances being used (or counts in the contingency table entries, to be more > precise and account for skewed category distributions). > * Add a check in the code which could: > ** Log a warning message > ** Alter the p-value to make sure it indicates the test result is > insignificant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model
[ https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359892#comment-14359892 ] Debasish Das commented on SPARK-3066: - We use the non-level 3 BLAS code in our internal flows with ~ 60M x 3M datasets...Runtime is decent...I am moving to level 3 BLAS for 4823 and I think the speed will improve further > Support recommendAll in matrix factorization model > -- > > Key: SPARK-3066 > URL: https://issues.apache.org/jira/browse/SPARK-3066 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Debasish Das > > ALS returns a matrix factorization model, which we can use to predict ratings > for individual queries as well as small batches. In practice, users may want > to compute top-k recommendations offline for all users. It is very expensive > but a common problem. We can do some optimization like > 1) collect one side (either user or product) and broadcast it as a matrix > 2) use level-3 BLAS to compute inner products > 3) use Utils.takeOrdered to find top-k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6299) ClassNotFoundException when running groupByKey with class defined in REPL.
[ https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359789#comment-14359789 ] Kevin (Sangwoo) Kim commented on SPARK-6299: Hi Sean, Surely it should work, I guess this is quite common pattern while working with spark shell. (This code works in Spark 1.1.1) > ClassNotFoundException when running groupByKey with class defined in REPL. > -- > > Key: SPARK-6299 > URL: https://issues.apache.org/jira/browse/SPARK-6299 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kevin (Sangwoo) Kim >Priority: Critical > > Anyone can reproduce this issue by the code below > (runs well in local mode, got exception with clusters) > (it runs well in Spark 1.1.1) > case class ClassA(value: String) > val rdd = sc.parallelize(List(("k1", ClassA("v1")), ("k1", ClassA("v2")) )) > rdd.groupByKey.collect > org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 > in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage > 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): > java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:274) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) > at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at sca
[jira] [Created] (SPARK-6312) ChiSqTest should check for too few counts
Joseph K. Bradley created SPARK-6312: Summary: ChiSqTest should check for too few counts Key: SPARK-6312 URL: https://issues.apache.org/jira/browse/SPARK-6312 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley ChiSqTest assumes that elements of the contingency matrix are large enough (have enough counts) s.t. the central limit theorem kicks in. It would be reasonable to do one or more of the following: * Add a note in the docs about making sure there are a reasonable number of instances being used (or counts in the contingency table entries, to be more precise and account for skewed category distributions). * Add a check in the code which could: ** Log a warning message ** Alter the p-value to make sure it indicates the test result is insignificant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6311) ChiSqTest should check for too few counts
Joseph K. Bradley created SPARK-6311: Summary: ChiSqTest should check for too few counts Key: SPARK-6311 URL: https://issues.apache.org/jira/browse/SPARK-6311 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley ChiSqTest assumes that elements of the contingency matrix are large enough (have enough counts) s.t. the central limit theorem kicks in. It would be reasonable to do one or more of the following: * Add a note in the docs about making sure there are a reasonable number of instances being used (or counts in the contingency table entries, to be more precise and account for skewed category distributions). * Add a check in the code which could: ** Log a warning message ** Alter the p-value to make sure it indicates the test result is insignificant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6310) ChiSqTest should check for too few counts
Joseph K. Bradley created SPARK-6310: Summary: ChiSqTest should check for too few counts Key: SPARK-6310 URL: https://issues.apache.org/jira/browse/SPARK-6310 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley ChiSqTest assumes that elements of the contingency matrix are large enough (have enough counts) s.t. the central limit theorem kicks in. It would be reasonable to do one or more of the following: * Add a note in the docs about making sure there are a reasonable number of instances being used (or counts in the contingency table entries, to be more precise and account for skewed category distributions). * Add a check in the code which could: ** Log a warning message ** Alter the p-value to make sure it indicates the test result is insignificant -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6308) VectorUDT is displayed as `vecto` in dtypes
Xiangrui Meng created SPARK-6308: Summary: VectorUDT is displayed as `vecto` in dtypes Key: SPARK-6308 URL: https://issues.apache.org/jira/browse/SPARK-6308 Project: Spark Issue Type: Bug Components: MLlib, SQL Reporter: Xiangrui Meng Assignee: Xiangrui Meng VectorUDT should override simpleString instead of relying on the default implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6309) Add MatrixUDT to support dense/sparse matrices in DataFrames
Xiangrui Meng created SPARK-6309: Summary: Add MatrixUDT to support dense/sparse matrices in DataFrames Key: SPARK-6309 URL: https://issues.apache.org/jira/browse/SPARK-6309 Project: Spark Issue Type: New Feature Components: MLlib, SQL Reporter: Xiangrui Meng This should support both dense and sparse matrices, similar to VectorUDT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6210) Generated column name should not include id of column in it.
[ https://issues.apache.org/jira/browse/SPARK-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359705#comment-14359705 ] Apache Spark commented on SPARK-6210: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5006 > Generated column name should not include id of column in it. > > > Key: SPARK-6210 > URL: https://issues.apache.org/jira/browse/SPARK-6210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > {code} > >>> df.groupBy().max('age').collect() > [Row(MAX(age#0)=5)] > >>> df3.groupBy().max('age', 'height').collect() > [Row(MAX(age#4L)=5, MAX(height#5L)=85)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6210) Generated column name should not include id of column in it.
[ https://issues.apache.org/jira/browse/SPARK-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-6210: - Assignee: Davies Liu (was: Michael Armbrust) > Generated column name should not include id of column in it. > > > Key: SPARK-6210 > URL: https://issues.apache.org/jira/browse/SPARK-6210 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > {code} > >>> df.groupBy().max('age').collect() > [Row(MAX(age#0)=5)] > >>> df3.groupBy().max('age', 'height').collect() > [Row(MAX(age#4L)=5, MAX(height#5L)=85)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359695#comment-14359695 ] Apache Spark commented on SPARK-2426: - User 'debasish83' has created a pull request for this issue: https://github.com/apache/spark/pull/5005 > Quadratic Minimization for MLlib ALS > > > Key: SPARK-2426 > URL: https://issues.apache.org/jira/browse/SPARK-2426 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Debasish Das >Assignee: Debasish Das > Original Estimate: 504h > Remaining Estimate: 504h > > Current ALS supports least squares and nonnegative least squares. > I presented ADMM and IPM based Quadratic Minimization solvers to be used for > the following ALS problems: > 1. ALS with bounds > 2. ALS with L1 regularization > 3. ALS with Equality constraint and bounds > Initial runtime comparisons are presented at Spark Summit. > http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark > Based on Xiangrui's feedback I am currently comparing the ADMM based > Quadratic Minimization solvers with IPM based QpSolvers and the default > ALS/NNLS. I will keep updating the runtime comparison results. > For integration the detailed plan is as follows: > 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization > 2. Integrate QuadraticMinimizer in mllib ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master
[ https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359665#comment-14359665 ] Nicholas Chammas commented on SPARK-5189: - For the record, this is the script I used to get the launch time stats above: {code} { python -m timeit -r 6 -n 1 \ --setup 'import subprocess; import time; subprocess.call("yes y | ./ec2/spark-ec2 destroy launch-test --identity-file /path/to/file.pem --key-pair my-pair --region us-east-1", shell=True); time.sleep(60)' \ 'subprocess.call("./ec2/spark-ec2 launch launch-test --slaves 99 --identity-file /path/to/file.pem --key-pair my-pair --region us-east-1 --zone us-east-1c --instance-type m3.large", shell=True)' yes y | ./ec2/spark-ec2 destroy launch-test --identity-file /path/to/file.pem --key-pair my-pair --region us-east-1 } {code} > Reorganize EC2 scripts so that nodes can be provisioned independent of Spark > master > --- > > Key: SPARK-5189 > URL: https://issues.apache.org/jira/browse/SPARK-5189 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas > > As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, > then setting up all the slaves together. This includes broadcasting files > from the lonely master to potentially hundreds of slaves. > There are 2 main problems with this approach: > # Broadcasting files from the master to all slaves using > [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] > (e.g. during [ephemeral-hdfs > init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], > or during [Spark > setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) > takes a long time. This time increases as the number of slaves increases. > I did some testing in {{us-east-1}}. This is, concretely, what the problem > looks like: > || number of slaves ({{m3.large}}) || launch time (best of 6 tries) || > | 1 | 8m 44s | > | 10 | 13m 45s | > | 25 | 22m 50s | > | 50 | 37m 30s | > | 75 | 51m 30s | > | 99 | 1h 5m 30s | > Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, > but I think the point is clear enough. > # It's more complicated to add slaves to an existing cluster (a la > [SPARK-2008]), since slaves are only configured through the master during the > setup of the master itself. > Logically, the operations we want to implement are: > * Provision a Spark node > * Join a node to a cluster (including an empty cluster) as either a master or > a slave > * Remove a node from a cluster > We need our scripts to roughly be organized to match the above operations. > The goals would be: > # When launching a cluster, enable all cluster nodes to be provisioned in > parallel, removing the master-to-slave file broadcast bottleneck. > # Facilitate cluster modifications like adding or removing nodes. > # Enable exploration of infrastructure tools like > [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} > internals and perhaps even allow us to build [one tool that launches Spark > clusters on several different cloud > platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. > More concretely, the modifications we need to make are: > * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with > equivalent, slave-side operations. > * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure > it fully creates a node that can be used as either a master or slave. > * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, > configures it as a master or slave, and joins it to a cluster. > * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete > that script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master
[ https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-5189: Description: As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, then setting up all the slaves together. This includes broadcasting files from the lonely master to potentially hundreds of slaves. There are 2 main problems with this approach: # Broadcasting files from the master to all slaves using [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] (e.g. during [ephemeral-hdfs init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], or during [Spark setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) takes a long time. This time increases as the number of slaves increases. I did some testing in {{us-east-1}}. This is, concretely, what the problem looks like: || number of slaves ({{m3.large}}) || launch time (best of 6 tries) || | 1 | 8m 44s | | 10 | 13m 45s | | 25 | 22m 50s | | 50 | 37m 30s | | 75 | 51m 30s | | 99 | 1h 5m 30s | Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but I think the point is clear enough. # It's more complicated to add slaves to an existing cluster (a la [SPARK-2008]), since slaves are only configured through the master during the setup of the master itself. Logically, the operations we want to implement are: * Provision a Spark node * Join a node to a cluster (including an empty cluster) as either a master or a slave * Remove a node from a cluster We need our scripts to roughly be organized to match the above operations. The goals would be: # When launching a cluster, enable all cluster nodes to be provisioned in parallel, removing the master-to-slave file broadcast bottleneck. # Facilitate cluster modifications like adding or removing nodes. # Enable exploration of infrastructure tools like [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} internals and perhaps even allow us to build [one tool that launches Spark clusters on several different cloud platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. More concretely, the modifications we need to make are: * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with equivalent, slave-side operations. * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it fully creates a node that can be used as either a master or slave. * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, configures it as a master or slave, and joins it to a cluster. * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete that script. was: As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, then setting up all the slaves together. This includes broadcasting files from the lonely master to potentially hundreds of slaves. There are 2 main problems with this approach: # Broadcasting files from the master to all slaves using [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] (e.g. during [ephemeral-hdfs init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], or during [Spark setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) takes a long time. This time increases as the number of slaves increases. # It's more complicated to add slaves to an existing cluster (a la [SPARK-2008]), since slaves are only configured through the master during the setup of the master itself. Logically, the operations we want to implement are: * Provision a Spark node * Join a node to a cluster (including an empty cluster) as either a master or a slave * Remove a node from a cluster We need our scripts to roughly be organized to match the above operations. The goals would be: # When launching a cluster, enable all cluster nodes to be provisioned in parallel, removing the master-to-slave file broadcast bottleneck. # Facilitate cluster modifications like adding or removing nodes. # Enable exploration of infrastructure tools like [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} internals and perhaps even allow us to build [one tool that launches Spark clusters on several different cloud platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. More concretely, the modifications we need to make are: * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with equivalent, slave-side operations. * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it fully creates a node that can be used as either a master or slave. * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned no
[jira] [Resolved] (SPARK-4588) Add API for feature attributes
[ https://issues.apache.org/jira/browse/SPARK-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-4588. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4925 [https://github.com/apache/spark/pull/4925] > Add API for feature attributes > -- > > Key: SPARK-4588 > URL: https://issues.apache.org/jira/browse/SPARK-4588 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Sean Owen >Priority: Critical > Fix For: 1.4.0 > > > Feature attributes, e.g., continuous/categorical, feature names, feature > dimension, number of categories, number of nonzeros (support) could be useful > for ML algorithms. > In SPARK-3569, we added metadata to schema, which can be used to store > feature attributes along with the dataset. We need to provide a wrapper over > the Metadata class for ML usage. > The design doc is available at > https://docs.google.com/document/d/1796XfSzFbZvGWFs0ky99AJhlqkOBRG1O2bUxK2N4Grk/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7
[ https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359614#comment-14359614 ] Yana Kadiyska commented on SPARK-5389: -- C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4>where find C:\Windows\System32\find.exe C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4>where findstr C:\Windows\System32\findstr.exe C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4>echo %PATH% C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files (x86)\Enterprise Vault\EVClient\;C:\Program Files (x86)\Git\cmd ;C:\Program Files (x86)\Perforce;C:\Program Files\MiKTeX 2.9\miktex\bin\x64\;C:\Program Files\Java\jdk1.7.0_40\bin;C:\Program Files (x86)\sbt\\bin;C:\Program Files (x86)\scala\bin; C:\apache-maven-3.1.0\bin;C:\Program Files\Java\jre7\bin\server;"c:\Program Files\R\R-3.0.2"\bin;C:\Python27 > spark-shell.cmd does not run from DOS Windows 7 > --- > > Key: SPARK-5389 > URL: https://issues.apache.org/jira/browse/SPARK-5389 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.2.0 > Environment: Windows 7 >Reporter: Yana Kadiyska > Attachments: SparkShell_Win7.JPG > > > spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. > spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2 > Marking as trivial since calling spark-shell2.cmd also works fine > Attaching a screenshot since the error isn't very useful: > {code} > spark-1.2.0-bin-cdh4>bin\spark-shell.cmd > else was unexpected at this time. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6294) PySpark task may hang while call take() on in Java/Scala
[ https://issues.apache.org/jira/browse/SPARK-6294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6294. -- Resolution: Fixed Fix Version/s: (was: 1.3.1) (was: 1.4.0) 1.2.2 Issue resolved by pull request 5003 [https://github.com/apache/spark/pull/5003] > PySpark task may hang while call take() on in Java/Scala > > > Key: SPARK-6294 > URL: https://issues.apache.org/jira/browse/SPARK-6294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0, 1.2.1 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.2.2 > > > {code} > >>> rdd = sc.parallelize(range(1<<20)).map(lambda x: str(x)) > >>> rdd._jrdd.first() > {code} > There is the stacktrace while hanging: > {code} > "Executor task launch worker-5" daemon prio=10 tid=0x7f8fd01a9800 > nid=0x566 in Object.wait() [0x7f90481d7000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x000630929340> (a > org.apache.spark.api.python.PythonRDD$WriterThread) > at java.lang.Thread.join(Thread.java:1281) > - locked <0x000630929340> (a > org.apache.spark.api.python.PythonRDD$WriterThread) > at java.lang.Thread.join(Thread.java:1355) > at > org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:78) > at > org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:76) > at > org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:58) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6268) KMeans parameter getter methods
[ https://issues.apache.org/jira/browse/SPARK-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6268. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4974 [https://github.com/apache/spark/pull/4974] > KMeans parameter getter methods > --- > > Key: SPARK-6268 > URL: https://issues.apache.org/jira/browse/SPARK-6268 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: yuhao yang >Priority: Minor > Fix For: 1.4.0 > > > KMeans has many setters for parameters. It should have matching getters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
[ https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359524#comment-14359524 ] Reynold Xin commented on SPARK-6190: If I can guarantee at the block manager level, all large blocks are chunked into smaller ones less than 2G, then there is no reason to support +2GB blocks at the block manager level. This affects the very core of Spark. It is important to think about how this will affect the long term Spark evolution (including explicit memory management, operating directly against records in the form of raw bytes, etc), rather than just rushing in, patching individual problems and leading to a codebase that has tons of random abstractions. On a separate topic, based on your design doc, LargeByteBuffer is still read only. There is no interface for LargeByteBufferOutputStream to even write to LargeByteBuffer. Can you include that? > create LargeByteBuffer abstraction for eliminating 2GB limit on blocks > -- > > Key: SPARK-6190 > URL: https://issues.apache.org/jira/browse/SPARK-6190 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Imran Rashid >Assignee: Imran Rashid > Attachments: LargeByteBuffer.pdf > > > A key component in eliminating the 2GB limit on blocks is creating a proper > abstraction for storing more than 2GB. Currently spark is limited by a > reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at > 2GB. This task will introduce the new abstraction and the relevant > implementation and utilities, without effecting the existing implementation > at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5622) Add connector/handler hive configuration settings to hive-thrift-server
[ https://issues.apache.org/jira/browse/SPARK-5622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5622. -- Resolution: Won't Fix This sounded more clearly like a WontFix from the PR. > Add connector/handler hive configuration settings to hive-thrift-server > --- > > Key: SPARK-5622 > URL: https://issues.apache.org/jira/browse/SPARK-5622 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0, 1.1.1 >Reporter: Alex Liu > > When integrate Cassandra Storage handler to Spark SQL, we need pass some > configuration settings to Hive-thrift-server hiveConf during server starting > process. > e.g. > {code} > ./sbin/start-thriftserver.sh --hiveconf cassandra.username=cassandra > --hiveconf cassandra.password=cassandra > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6307) Executers fetches the same rdd-block 100's or 1000's of times
Tobias Bertelsen created SPARK-6307: --- Summary: Executers fetches the same rdd-block 100's or 1000's of times Key: SPARK-6307 URL: https://issues.apache.org/jira/browse/SPARK-6307 Project: Spark Issue Type: Bug Affects Versions: 2+ Environment: Linux, Spark Standalone 2.10, running in a PBS grid engine Reporter: Tobias Bertelsen The block manager keept fetching the same blocks over and over, making tasks with network activity extremely slow. Two identical tasks can take between 12 seconds up to more than an hour. (where I stopped it). Spark should cache the blocks, so it does not fetch the same blocks over, and over, and over. Here is a simplified version of the code that provokes it: {code} // Read a few thousand lines (~ 15 MB) val fileContents = sc.newAPIHadoopFile(path, ..).repartition(16) val data = fileContents.map{x => parseContent(x)}.cache() // Do a pairwise comparison and count the best pairs val pairs = data.cartesian(data).filter { case ((x,y) => similarity(x, y) > 0.9 } pairs.count() {code} This is a tiny fraction of one of the worker's stderr: {code} 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_1 remotely 15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_0 remotely Thousands more lines, fetching the same 16 remote blocks 15/03/12 22:25:44 INFO BlockManager: Found block rdd_8_0 remotely 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely 15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely {code} h2. Details for that stage from the UI. - *Total task time across all tasks:* 11.9 h - *Input:* 2.2 GB - *Shuffle read:* 4.5 MB h3. Summary Metrics for 176 Completed Tasks || Metric || Min || 25th percentile || Median || 75th percentile || Max || | Duration | 7 s | 8 s | 8 s | 12 s | 59 min | | GC Time | 0 ms | 99 ms | 0.1 s | 0.2 s | 0.5 s | | Input | 6.9 MB | 8.2 MB | 8.4 MB | 9.0 MB | 11.0 MB | | Shuffle Read (Remote) | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 676.6 KB | h3. Aggregated Metrics by Executor || Executor ID || Address || Task Time || Total Tasks || Failed Tasks || Succeeded Tasks || Input || Output || Shuffle Read || Shuffle Write || Shuffle Spill (Memory) || Shuffle Spill (Disk) || | 0 | n-62-23-3:49566 | 5.7 h | 9 | 0 | 9 | 171.0 MB | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B | | 1 | n-62-23-6:57518 | 16.4 h | 20 | 0 | 20 | 169.9 MB | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B | | 2 | n-62-18-48:33551 | 0 ms | 0 | 0 | 0 | 169.6 MB | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B | | 3 | n-62-23-5:58421 | 2.9 min | 12 | 0 | 12 | 266.2 MB | 0.0 B | 4.5 MB | 0.0 B | 0.0 B | 0.0 B | | 4 | n-62-23-1:40096 | 23 min | 164 | 0 | 164 | 1430.4 MB | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 0.0 B | h3. Tasks || Index || ID || Attempt || Status || Locality Level || Executor ID / Host || Launch Time || Duration || GC Time || Input || Shuffle Read || Errors || | 1 | 2 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s | 0.1 s | 6.9 MB (memory) | 676.6 KB || | 0 | 1 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min | 0.3 s | 8.7 MB (network) | 0.0 B || | 4 | 5 | 0 | SUCCESS | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 38 min | 0.4 s | 8.6 MB (network) | 0.0 B || | 3 | 4 | 0 | RUNNING | ANY | 2 / n-62-18-48 | 2015/03/12 21:55:00 | 55 min | | 8.3 MB (network) | 0.0 B || | 2 | 3 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 11 s | 0.3 s | 8.4 MB (memory) | 0.0 B || | 7 | 8 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 12 s | 0.3 s | 9.2 MB (memory) | 0.0 B || | 6 | 7 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s | 0.1 s | 8.1 MB (memory) | 0.0 B || | 5 | 6 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min | 0.3 s | 8.6 MB (network) | 0.0 B || | 9 | 10 | 0 | RUNNING | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 55 min | | 8.7 MB (network) | 0.0 B || -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5740) Change comment default value from empty string to "null" in DescribeCommand
[ https://issues.apache.org/jira/browse/SPARK-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5740: - Priority: Minor (was: Major) Target Version/s: (was: 1.4.0) Fix Version/s: (was: 1.3.0) Given the PR discussion, is this WontFix? i wasn't 100% sure. > Change comment default value from empty string to "null" in DescribeCommand > --- > > Key: SPARK-5740 > URL: https://issues.apache.org/jira/browse/SPARK-5740 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Li Sheng >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > Change comment default value from empty string to "null" in DescribeCommand -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.
[ https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359437#comment-14359437 ] Sean Owen commented on SPARK-4927: -- OK, behavior looks a little different on YARN. I find memory usage, however, stabilizes quickly. For example with 2 executors / 512M / 1 core each, they show 461 and 467 MB free, +/- 1MB. With 5 executors, 8 cores, 512MB, about 500MB is free very consistently over time. Maybe it was resolved at some point? > Spark does not clean up properly during long jobs. > --- > > Key: SPARK-4927 > URL: https://issues.apache.org/jira/browse/SPARK-4927 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Ilya Ganelin > > On a long running Spark job, Spark will eventually run out of memory on the > driver node due to metadata overhead from the shuffle operation. Spark will > continue to operate, however with drastically decreased performance (since > swapping now occurs with every operation). > The spark.cleanup.tll parameter allows a user to configure when cleanup > happens but the issue with doing this is that it isn’t done safely, e.g. If > this clears a cached RDD or active task in the middle of processing a stage, > this ultimately causes a KeyNotFoundException when the next stage attempts to > reference the cleared RDD or task. > There should be a sustainable mechanism for cleaning up stale metadata that > allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359404#comment-14359404 ] Nicholas Chammas commented on SPARK-6282: - Shouldn't be related to boto. "_winreg" appears to be something Python uses to access the Windows registry, which is strange. Please give us more details about your cluster setup, where you are running the driver from, etc. Also, what if you try using numpy's implementation of {{random}}? > Strange Python import error when using random() in a lambda function > > > Key: SPARK-6282 > URL: https://issues.apache.org/jira/browse/SPARK-6282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: Kubuntu 14.04, Python 2.7.6 >Reporter: Pavel Laskov >Priority: Minor > > Consider the exemplary Python code below: >from random import random >from pyspark.context import SparkContext >from xval_mllib import read_csv_file_as_list > if __name__ == "__main__": > sc = SparkContext(appName="Random() bug test") > data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) > #data = sc.parallelize([1, 2, 3, 4, 5], 2) > d = data.map(lambda x: (random(), x)) > print d.first() > Data is read from a large CSV file. Running this code results in a Python > import error: > ImportError: No module named _winreg > If I use 'import random' and 'random.random()' in the lambda function no > error occurs. Also no error occurs, for both kinds of import statements, for > a small artificial data set like the one shown in a commented line. > The full error trace, the source code of csv reading code (function > 'read_csv_file_as_list' is my own) as well as a sample dataset (the original > dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark
[ https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359366#comment-14359366 ] mike bowles commented on SPARK-1673: Here's a table of scaling results for our implementation of glmnet regression. These are run locally on a 4-core server. The data set is the higgs boson data set (available on aws). We measured training times for various numbers of rows of data from 1000 to 10 million. The attribute space is 28 variables wide. We ran on 1 through 4 cores on the server. Training times (sec) #rows 1-core 2-core 3-core 4-cores 100K4.883.793.413.48 1M 20.510.69.518.45 5M 71.237.126.725.5 10M 155 70.559.749.7 The structure of the algorithm suggests that the training times should be linear in the number of rows and the test results bear that out. Two cores shows a speedup of ~2 over one core and three cores shows ~2.6 over one core and four cores speeds up by ~3.11. The four core result probably lags due to conflict with system function etc. Running on AWS will make that clearer. That's in process now. Our next steps are 1. run on some wider data sets 2. run on larger cluster 3. run OWLQN on the same problems in the same setting 4. experiment with speedups - Joseph Bradley's approximation idea and cutting the number of data passes down by predicting what variables are going to become active instead of waiting until they do. > GLMNET implementation in Spark > -- > > Key: SPARK-1673 > URL: https://issues.apache.org/jira/browse/SPARK-1673 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Sung Chung > > This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, > Rob Tibshirani. > http://www.jstatsoft.org/v33/i01/paper > It's a straightforward implementation of the Coordinate-Descent based L1/L2 > regularized linear models, including Linear/Logistic/Multinomial regressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359362#comment-14359362 ] Sean Owen commented on SPARK-6282: -- [~nchammas] or [~shivaram] might have a clue if it distantly relates to boto. > Strange Python import error when using random() in a lambda function > > > Key: SPARK-6282 > URL: https://issues.apache.org/jira/browse/SPARK-6282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: Kubuntu 14.04, Python 2.7.6 >Reporter: Pavel Laskov >Priority: Minor > > Consider the exemplary Python code below: >from random import random >from pyspark.context import SparkContext >from xval_mllib import read_csv_file_as_list > if __name__ == "__main__": > sc = SparkContext(appName="Random() bug test") > data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) > #data = sc.parallelize([1, 2, 3, 4, 5], 2) > d = data.map(lambda x: (random(), x)) > print d.first() > Data is read from a large CSV file. Running this code results in a Python > import error: > ImportError: No module named _winreg > If I use 'import random' and 'random.random()' in the lambda function no > error occurs. Also no error occurs, for both kinds of import statements, for > a small artificial data set like the one shown in a commented line. > The full error trace, the source code of csv reading code (function > 'read_csv_file_as_list' is my own) as well as a sample dataset (the original > dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359336#comment-14359336 ] Joseph K. Bradley commented on SPARK-6282: -- It looks like "winreg" is referenced in Spark's dependencies (specifically, "boto" which is used for ec2). I'm not very familiar with that part, and it's strange to me that it's ML-specific. If others here aren't sure, I'd try asking on the user list. > Strange Python import error when using random() in a lambda function > > > Key: SPARK-6282 > URL: https://issues.apache.org/jira/browse/SPARK-6282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 > Environment: Kubuntu 14.04, Python 2.7.6 >Reporter: Pavel Laskov >Priority: Minor > > Consider the exemplary Python code below: >from random import random >from pyspark.context import SparkContext >from xval_mllib import read_csv_file_as_list > if __name__ == "__main__": > sc = SparkContext(appName="Random() bug test") > data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) > #data = sc.parallelize([1, 2, 3, 4, 5], 2) > d = data.map(lambda x: (random(), x)) > print d.first() > Data is read from a large CSV file. Running this code results in a Python > import error: > ImportError: No module named _winreg > If I use 'import random' and 'random.random()' in the lambda function no > error occurs. Also no error occurs, for both kinds of import statements, for > a small artificial data set like the one shown in a commented line. > The full error trace, the source code of csv reading code (function > 'read_csv_file_as_list' is my own) as well as a sample dataset (the original > dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.
[ https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359294#comment-14359294 ] Ilya Ganelin commented on SPARK-4927: - Are you running over yarn? My theory is that the memory usage has to do with data movement between nodes. Sent with Good (www.good.com) > Spark does not clean up properly during long jobs. > --- > > Key: SPARK-4927 > URL: https://issues.apache.org/jira/browse/SPARK-4927 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Ilya Ganelin > > On a long running Spark job, Spark will eventually run out of memory on the > driver node due to metadata overhead from the shuffle operation. Spark will > continue to operate, however with drastically decreased performance (since > swapping now occurs with every operation). > The spark.cleanup.tll parameter allows a user to configure when cleanup > happens but the issue with doing this is that it isn’t done safely, e.g. If > this clears a cached RDD or active task in the middle of processing a stage, > this ultimately causes a KeyNotFoundException when the next stage attempts to > reference the cleared RDD or task. > There should be a sustainable mechanism for cleaning up stale metadata that > allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3424) KMeans Plus Plus is too slow
[ https://issues.apache.org/jira/browse/SPARK-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359241#comment-14359241 ] Xiangrui Meng commented on SPARK-3424: -- Ah, sorry! I typed your email manually in the commit message but I missed "r". The commit message is immutable, so I cannot update it now. I'll be more careful next time. > KMeans Plus Plus is too slow > > > Key: SPARK-3424 > URL: https://issues.apache.org/jira/browse/SPARK-3424 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.0.2 >Reporter: Derrick Burns >Assignee: Derrick Burns > Fix For: 1.3.0 > > > The KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is > the rounds of the KMeansParallel algorithm and k is the number of clusters. > This can be dramatically improved by maintaining the distance the closest > cluster center from round to round and then incrementally updating that value > for each point. This incremental update is O(1) time, this reduces the > running time for K Means Plus Plus to O( m k ). For large k, this is > significant. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4001) Add FP-growth algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4001: - Summary: Add FP-growth algorithm to Spark MLlib (was: Add Apriori algorithm to Spark MLlib) > Add FP-growth algorithm to Spark MLlib > -- > > Key: SPARK-4001 > URL: https://issues.apache.org/jira/browse/SPARK-4001 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Jacky Li >Assignee: Jacky Li > Fix For: 1.3.0 > > Attachments: Distributed frequent item mining algorithm based on > Spark.pptx > > > Apriori is the classic algorithm for frequent item set mining in a > transactional data set. It will be useful if Apriori algorithm is added to > MLLib in Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.
[ https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359200#comment-14359200 ] Sean Owen commented on SPARK-4927: -- Yes that's what I'm running in spark-shell (plus imports, and removing that log that didn't compile for some reason). I don't see decreasing memory available. > Spark does not clean up properly during long jobs. > --- > > Key: SPARK-4927 > URL: https://issues.apache.org/jira/browse/SPARK-4927 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Ilya Ganelin > > On a long running Spark job, Spark will eventually run out of memory on the > driver node due to metadata overhead from the shuffle operation. Spark will > continue to operate, however with drastically decreased performance (since > swapping now occurs with every operation). > The spark.cleanup.tll parameter allows a user to configure when cleanup > happens but the issue with doing this is that it isn’t done safely, e.g. If > this clears a cached RDD or active task in the middle of processing a stage, > this ultimately causes a KeyNotFoundException when the next stage attempts to > reference the cleared RDD or task. > There should be a sustainable mechanism for cleaning up stale metadata that > allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359166#comment-14359166 ] Patrick Wendell commented on SPARK-5654: I see the decision here as somewhat orthogonal to vendors and vendor packaging. Vendors can chose whether to package this component or not, and some may leave it out until it gets more mature. Of course, they are more encouraged/pressured to package things that end up inside the project itself, but that could be used to justify merging all kinds of random stuff into Spark, so I don't think it's a sufficient justification. The main argument as I said before is just that non-JVM language API's are really just not possible to maintain outside of the project, because it's not building on any even remotely "public" API. Imagine if we tried to have PySpark as it's own project, it is so tightly coupled that it wouldn't work. I have argued in the past for things to existing outside the project when they can, and that I still promote that strongly. > Integrate SparkR into Apache Spark > -- > > Key: SPARK-5654 > URL: https://issues.apache.org/jira/browse/SPARK-5654 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Reporter: Shivaram Venkataraman > > The SparkR project [1] provides a light-weight frontend to launch Spark jobs > from R. The project was started at the AMPLab around a year ago and has been > incubated as its own project to make sure it can be easily merged into > upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s > goals are similar to PySpark and shares a similar design pattern as described > in our meetup talk[2], Spark Summit presentation[3]. > Integrating SparkR into the Apache project will enable R users to use Spark > out of the box and given R’s large user base, it will help the Spark project > reach more users. Additionally, work in progress features like providing R > integration with ML Pipelines and Dataframes can be better achieved by > development in a unified code base. > SparkR is available under the Apache 2.0 License and does not have any > external dependencies other than requiring users to have R and Java installed > on their machines. SparkR’s developers come from many organizations > including UC Berkeley, Alteryx, Intel and we will support future development, > maintenance after the integration. > [1] https://github.com/amplab-extras/SparkR-pkg > [2] http://files.meetup.com/3138542/SparkR-meetup.pdf > [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4927) Spark does not clean up properly during long jobs.
[ https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358958#comment-14358958 ] Ilya Ganelin edited comment on SPARK-4927 at 3/12/15 6:50 PM: -- Hi Sean - I have a code snippet that reproduced this. Let me send it to you in a bit - I don't have the means to run 1.3 in a cluster. Realized that I already had that code snippet posted. Running the above code doesn't reproduce the issue? was (Author: ilganeli): Hi Sean - I have a code snippet that reproduced this. Let me send it to you in a bit - I don't have the means to run 1.3 in a cluster. Sent with Good (www.good.com) > Spark does not clean up properly during long jobs. > --- > > Key: SPARK-4927 > URL: https://issues.apache.org/jira/browse/SPARK-4927 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Ilya Ganelin > > On a long running Spark job, Spark will eventually run out of memory on the > driver node due to metadata overhead from the shuffle operation. Spark will > continue to operate, however with drastically decreased performance (since > swapping now occurs with every operation). > The spark.cleanup.tll parameter allows a user to configure when cleanup > happens but the issue with doing this is that it isn’t done safely, e.g. If > this clears a cached RDD or active task in the middle of processing a stage, > this ultimately causes a KeyNotFoundException when the next stage attempts to > reference the cleared RDD or task. > There should be a sustainable mechanism for cleaning up stale metadata that > allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4012) Uncaught OOM in ContextCleaner
[ https://issues.apache.org/jira/browse/SPARK-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359062#comment-14359062 ] Apache Spark commented on SPARK-4012: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/5004 > Uncaught OOM in ContextCleaner > -- > > Key: SPARK-4012 > URL: https://issues.apache.org/jira/browse/SPARK-4012 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Nan Zhu >Assignee: Nan Zhu > > When running an "might-be-memory-intensive" application locally, I received > the following exception > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "Spark Context Cleaner" > Java HotSpot(TM) 64-Bit Server VM warning: Exception > java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the > VM may need to be forcibly terminated > Exception: java.lang.OutOfMemoryError thrown from the > UncaughtExceptionHandler in thread "Driver Heartbeater" > Java HotSpot(TM) 64-Bit Server VM warning: Exception > java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- > the VM may need to be forcibly terminated > Java HotSpot(TM) 64-Bit Server VM warning: Exception > java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the > VM may need to be forcibly terminated > Java HotSpot(TM) 64-Bit Server VM warning: Exception > java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the > VM may need to be forcibly terminated > Java HotSpot(TM) 64-Bit Server VM warning: Exception > java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the > VM may need to be forcibly terminated > Java HotSpot(TM) 64-Bit Server VM warning: Exception > java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the > VM may need to be forcibly terminated > Java HotSpot(TM) 64-Bit Server VM warning: Exception > java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the > VM may need to be forcibly terminated > I looked at the code, we might want to call Utils.tryOrExit instead of > Utils.logUncaughtExceptions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
[ https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359027#comment-14359027 ] Sean Owen commented on SPARK-1564: -- Yeah that's what I did, just made it not tied to the old 1.0 parent issue. > Add JavaScript into Javadoc to turn ::Experimental:: and such into badges > - > > Key: SPARK-1564 > URL: https://issues.apache.org/jira/browse/SPARK-1564 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Matei Zaharia >Assignee: Andrew Or >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
[ https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359017#comment-14359017 ] Matei Zaharia commented on SPARK-1564: -- This is still a valid issue AFAIK, isn't it? These things still show up badly in Javadoc. So we could change the parent issue or something but I'd like to see it fixed. > Add JavaScript into Javadoc to turn ::Experimental:: and such into badges > - > > Key: SPARK-1564 > URL: https://issues.apache.org/jira/browse/SPARK-1564 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Matei Zaharia >Assignee: Andrew Or >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6294) PySpark task may hang while call take() on in Java/Scala
[ https://issues.apache.org/jira/browse/SPARK-6294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359016#comment-14359016 ] Apache Spark commented on SPARK-6294: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5003 > PySpark task may hang while call take() on in Java/Scala > > > Key: SPARK-6294 > URL: https://issues.apache.org/jira/browse/SPARK-6294 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0, 1.2.1 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 1.4.0, 1.3.1 > > > {code} > >>> rdd = sc.parallelize(range(1<<20)).map(lambda x: str(x)) > >>> rdd._jrdd.first() > {code} > There is the stacktrace while hanging: > {code} > "Executor task launch worker-5" daemon prio=10 tid=0x7f8fd01a9800 > nid=0x566 in Object.wait() [0x7f90481d7000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x000630929340> (a > org.apache.spark.api.python.PythonRDD$WriterThread) > at java.lang.Thread.join(Thread.java:1281) > - locked <0x000630929340> (a > org.apache.spark.api.python.PythonRDD$WriterThread) > at java.lang.Thread.join(Thread.java:1355) > at > org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:78) > at > org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:76) > at > org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68) > at > org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:58) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5310) Update SQL programming guide for 1.3
[ https://issues.apache.org/jira/browse/SPARK-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358981#comment-14358981 ] Apache Spark commented on SPARK-5310: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/5001 > Update SQL programming guide for 1.3 > > > Key: SPARK-5310 > URL: https://issues.apache.org/jira/browse/SPARK-5310 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > We make quite a few changes. We should update the SQL programming guide to > reflect these changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState
[ https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358964#comment-14358964 ] Apache Spark commented on SPARK-6286: - User 'dragos' has created a pull request for this issue: https://github.com/apache/spark/pull/5000 > Handle TASK_ERROR in TaskState > -- > > Key: SPARK-6286 > URL: https://issues.apache.org/jira/browse/SPARK-6286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Iulian Dragos >Priority: Minor > Labels: mesos > > Scala warning: > {code} > match may not be exhaustive. It would fail on the following input: TASK_ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.
[ https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358958#comment-14358958 ] Ilya Ganelin commented on SPARK-4927: - Hi Sean - I have a code snippet that reproduced this. Let me send it to you in a bit - I don't have the means to run 1.3 in a cluster. Sent with Good (www.good.com) > Spark does not clean up properly during long jobs. > --- > > Key: SPARK-4927 > URL: https://issues.apache.org/jira/browse/SPARK-4927 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Ilya Ganelin > > On a long running Spark job, Spark will eventually run out of memory on the > driver node due to metadata overhead from the shuffle operation. Spark will > continue to operate, however with drastically decreased performance (since > swapping now occurs with every operation). > The spark.cleanup.tll parameter allows a user to configure when cleanup > happens but the issue with doing this is that it isn’t done safely, e.g. If > this clears a cached RDD or active task in the middle of processing a stage, > this ultimately causes a KeyNotFoundException when the next stage attempts to > reference the cleared RDD or task. > There should be a sustainable mechanism for cleaning up stale metadata that > allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.
[ https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358933#comment-14358933 ] Sean Owen commented on SPARK-4927: -- I'm interested in this one. When I run it though it holds steady though: 15/03/12 16:37:17 INFO MemoryStore: Block broadcast_29480_piece0 stored as bytes in memory (estimated size 1395.0 B, free 133.2 MB) It's always 133.1 or 133.2 MB for me. I wonder if you can still reproduce this on 1.3? > Spark does not clean up properly during long jobs. > --- > > Key: SPARK-4927 > URL: https://issues.apache.org/jira/browse/SPARK-4927 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Ilya Ganelin > > On a long running Spark job, Spark will eventually run out of memory on the > driver node due to metadata overhead from the shuffle operation. Spark will > continue to operate, however with drastically decreased performance (since > swapping now occurs with every operation). > The spark.cleanup.tll parameter allows a user to configure when cleanup > happens but the issue with doing this is that it isn’t done safely, e.g. If > this clears a cached RDD or active task in the middle of processing a stage, > this ultimately causes a KeyNotFoundException when the next stage attempts to > reference the cleared RDD or task. > There should be a sustainable mechanism for cleaning up stale metadata that > allows the program to continue running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState
[ https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358905#comment-14358905 ] Iulian Dragos commented on SPARK-6286: -- Sure, I'll issue a PR for handling {{TASK_ERROR => TASK_LOST}} > Handle TASK_ERROR in TaskState > -- > > Key: SPARK-6286 > URL: https://issues.apache.org/jira/browse/SPARK-6286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Iulian Dragos >Priority: Minor > Labels: mesos > > Scala warning: > {code} > match may not be exhaustive. It would fail on the following input: TASK_ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1546) Add AdaBoost algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358881#comment-14358881 ] Manish Amde commented on SPARK-1546: I haven't worked on it since we haven't heard a need for it post RF and GBT work. :-) This might be best done after the API standardization work on https://issues.apache.org/jira/browse/SPARK-6113 > Add AdaBoost algorithm to Spark MLlib > - > > Key: SPARK-1546 > URL: https://issues.apache.org/jira/browse/SPARK-1546 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Manish Amde >Assignee: Manish Amde > > This task requires adding the AdaBoost algorithm to Spark MLlib. The > implementation needs to adapt the classic AdaBoost algorithm to the scalable > tree implementation. > The tasks involves: > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6300) sc.addFile(path) does not support the relative path.
[ https://issues.apache.org/jira/browse/SPARK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358877#comment-14358877 ] Sean Owen commented on SPARK-6300: -- (Sandy notes it's a regression so yeah it's more important. I didn't think this was ever supposed to work) > sc.addFile(path) does not support the relative path. > > > Key: SPARK-6300 > URL: https://issues.apache.org/jira/browse/SPARK-6300 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.2.1 >Reporter: DoingDone9 >Assignee: DoingDone9 >Priority: Critical > > when i run cmd like that sc.addFile("../test.txt"), it did not work and throw > an exception > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: file:../test.txt > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > > ... > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > file:../test.txt > at java.net.URI.checkPath(URI.java:1804) > at java.net.URI.(URI.java:752) > at org.apache.hadoop.fs.Path.initialize(Path.java:203) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6299) ClassNotFoundException when running groupByKey with class defined in REPL.
[ https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6299: - Component/s: Spark Shell > ClassNotFoundException when running groupByKey with class defined in REPL. > -- > > Key: SPARK-6299 > URL: https://issues.apache.org/jira/browse/SPARK-6299 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kevin (Sangwoo) Kim >Priority: Critical > > Anyone can reproduce this issue by the code below > (runs well in local mode, got exception with clusters) > (it runs well in Spark 1.1.1) > case class ClassA(value: String) > val rdd = sc.parallelize(List(("k1", ClassA("v1")), ("k1", ClassA("v2")) )) > rdd.groupByKey.collect > org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 > in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage > 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): > java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:274) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) > at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) > at > org.apache.spark.scheduler.DAGSched
[jira] [Updated] (SPARK-6300) sc.addFile(path) does not support the relative path.
[ https://issues.apache.org/jira/browse/SPARK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-6300: -- Priority: Critical (was: Minor) Target Version/s: 1.3.1 > sc.addFile(path) does not support the relative path. > > > Key: SPARK-6300 > URL: https://issues.apache.org/jira/browse/SPARK-6300 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.2.1 >Reporter: DoingDone9 >Assignee: DoingDone9 >Priority: Critical > > when i run cmd like that sc.addFile("../test.txt"), it did not work and throw > an exception > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: file:../test.txt > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > > ... > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > file:../test.txt > at java.net.URI.checkPath(URI.java:1804) > at java.net.URI.(URI.java:752) > at org.apache.hadoop.fs.Path.initialize(Path.java:203) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6273) Got error when one table's alias name is the same with other table's column name
[ https://issues.apache.org/jira/browse/SPARK-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6273: - Component/s: SQL Description: while one table's alias name is the same with other table's column name get the error Ambiguous references {code} Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Ambiguous references to salary.pay_date: (pay_date#34749,List()),(salary#34792,List(pay_date)), tree: 'Filter 'salary.pay_date = 'time_by_day.the_date) && ('time_by_day.the_year = 1997.0)) && ('salary.employee_id = 'employee.employee_id)) && ('employee.store_id = 'store.store_id)) Join Inner, None Join Inner, None Join Inner, None MetastoreRelation yxqtest, time_by_day, Some(time_by_day) MetastoreRelation yxqtest, salary, Some(salary) MetastoreRelation yxqtest, store, Some(store) MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0) Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Ambiguous references to salary.pay_date: (pay_date#34749,List()),(salary#34792,List(pay_date)), tree: 'Filter 'salary.pay_date = 'time_by_day.the_date) && ('time_by_day.the_year = 1997.0)) && ('salary.employee_id = 'employee.employee_id)) && ('employee.store_id = 'store.store_id)) Join Inner, None Join Inner, None Join Inner, None MetastoreRelation yxqtest, time_by_day, Some(time_by_day) MetastoreRelation yxqtest, salary, Some(salary) MetastoreRelation yxqtest, store, Some(store) MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0) {code} was: while one table's alias name is the same with other table's column name get the error Ambiguous references Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Ambiguous references to salary.pay_date: (pay_date#34749,List()),(salary#34792,List(pay_date)), tree: 'Filter 'salary.pay_date = 'time_by_day.the_date) && ('time_by_day.the_year = 1997.0)) && ('salary.employee_id = 'employee.employee_id)) && ('employee.store_id = 'store.store_id)) Join Inner, None Join Inner, None Join Inner, None MetastoreRelation yxqtest, time_by_day, Some(time_by_day) MetastoreRelation yxqtest, salary, Some(salary) MetastoreRelation yxqtest, store, Some(store) MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0) Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Ambiguous references to salary.pay_date: (pay_date#34749,List()),(salary#34792,List(pay_date)), tree: 'Filter 'salary.pay_date = 'time_by_day.the_date) && ('time_by_day.the_year = 1997.0)) && ('salary.employee_id = 'employee.employee_id)) && ('employee.store_id = 'store.store_id)) Join Inner, None Join Inner, None Join Inner, None MetastoreRelation yxqtest, time_by_day, Some(time_by_day) MetastoreRelation yxqtest, salary, Some(salary) MetastoreRelation yxqtest, store, Some(store) MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0) Summary: Got error when one table's alias name is the same with other table's column name (was: Got error when do join) (Make the title more descriptive and add a component) > Got error when one table's alias name is the same with other table's column > name > > > Key: SPARK-6273 > URL: https://issues.apache.org/jira/browse/SPARK-6273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.1 >Reporter: Jeff > > while one table's alias name is the same with other table's column name > get the error Ambiguous references > {code} > Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Ambiguous references to salary.pay_date: > (pay_date#34749,List()),(salary#34792,List(pay_date)), tree: > 'Filter 'salary.pay_date = 'time_by_day.the_date) && > ('time_by_day.the_year = 1997.0)) && ('salary.employee_id = > 'employee.employee_id)) && ('employee.store_id = 'store.store_id)) > Join Inner, None > Join Inner, None >Join Inner, None > MetastoreRelation yxqtest, time_by_day, Some(time_by_day) > MetastoreRelation yxqtest, salary, Some(salary) >MetastoreRelation yxqtest, store, Some(store) > MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0) > Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Ambiguous references to salary.pay_date: > (pay_date#34749,List()),(salary#34792,List(pay_date)), tree: > 'Filter 'salary.pay_date = 'time_by_day.the_date) && > ('time_by_day.the_year = 1997.0)) && ('salary.employee_id = > 'employee.employee_id)) && ('employee.store_id = 'store.store_id)) > Join Inner, None > Join Inner, None >Join Inner, None > MetastoreRelation yxqtest, time_by_day, Some(time_by_day) > MetastoreRelation yx
[jira] [Commented] (SPARK-1548) Add Partial Random Forest algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358871#comment-14358871 ] Manish Amde commented on SPARK-1548: We should also leave this ticket unassigned for somebody else to pick up if/when interested. > Add Partial Random Forest algorithm to MLlib > > > Key: SPARK-1548 > URL: https://issues.apache.org/jira/browse/SPARK-1548 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde >Assignee: Frank Dai > > This task involves creating an alternate approximate random forest > implementation where each tree is constructed per partition. > The tasks involves: > - Justifying with theory and experimental results why this algorithm is a > good choice. > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1548) Add Partial Random Forest algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1548: - Assignee: (was: Frank Dai) > Add Partial Random Forest algorithm to MLlib > > > Key: SPARK-1548 > URL: https://issues.apache.org/jira/browse/SPARK-1548 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.0.0 >Reporter: Manish Amde > > This task involves creating an alternate approximate random forest > implementation where each tree is constructed per partition. > The tasks involves: > - Justifying with theory and experimental results why this algorithm is a > good choice. > - Comparing the various tradeoffs and finalizing the algorithm before > implementation > - Code implementation > - Unit tests > - Functional tests > - Performance tests > - Documentation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6306) Readme points to dead link
[ https://issues.apache.org/jira/browse/SPARK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358850#comment-14358850 ] Theodore Vasiloudis commented on SPARK-6306: I'll keep that in mind in the future. > Readme points to dead link > -- > > Key: SPARK-6306 > URL: https://issues.apache.org/jira/browse/SPARK-6306 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Theodore Vasiloudis >Priority: Trivial > Fix For: 1.4.0 > > > The link to "Specifying the Hadoop Version" now points to > http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version. > The correct link is: > http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6300) sc.addFile(path) does not support the relative path.
[ https://issues.apache.org/jira/browse/SPARK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6300: - Priority: Minor (was: Critical) Target Version/s: (was: 1.3.1) > sc.addFile(path) does not support the relative path. > > > Key: SPARK-6300 > URL: https://issues.apache.org/jira/browse/SPARK-6300 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.2.1 >Reporter: DoingDone9 >Assignee: DoingDone9 >Priority: Minor > > when i run cmd like that sc.addFile("../test.txt"), it did not work and throw > an exception > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: file:../test.txt > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > > ... > Caused by: java.net.URISyntaxException: Relative path in absolute URI: > file:../test.txt > at java.net.URI.checkPath(URI.java:1804) > at java.net.URI.(URI.java:752) > at org.apache.hadoop.fs.Path.initialize(Path.java:203) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6299) ClassNotFoundException when running groupByKey with class defined in REPL.
[ https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358816#comment-14358816 ] Sean Owen commented on SPARK-6299: -- Hm, is this supposed to work? the class is not defined outside your driver process, and isn't found on the executors as a result. > ClassNotFoundException when running groupByKey with class defined in REPL. > -- > > Key: SPARK-6299 > URL: https://issues.apache.org/jira/browse/SPARK-6299 > Project: Spark > Issue Type: Bug >Affects Versions: 1.3.0, 1.2.1 >Reporter: Kevin (Sangwoo) Kim >Priority: Critical > > Anyone can reproduce this issue by the code below > (runs well in local mode, got exception with clusters) > (it runs well in Spark 1.1.1) > case class ClassA(value: String) > val rdd = sc.parallelize(List(("k1", ClassA("v1")), ("k1", ClassA("v2")) )) > rdd.groupByKey.collect > org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 > in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage > 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): > java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:274) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) > at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91) > at > org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[jira] [Commented] (SPARK-6301) Unable to load external jars while submitting Spark Job
[ https://issues.apache.org/jira/browse/SPARK-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358815#comment-14358815 ] raju patel commented on SPARK-6301: --- I am trying to call Java functions which is basically loading java classes from the jar using Python .To achieve this goal I am using jnius which acts as a bridge between Python and Java. When I submit the spark job spark-submit --master local --jars /pathto/jar pyhtonfile.py It gives me the above mentioned error Class not found 'classname' that are present inside that Jar Yes, I have carefully verified all the classes are present inside the jar. Please let me know if you want to know any other details. > Unable to load external jars while submitting Spark Job > --- > > Key: SPARK-6301 > URL: https://issues.apache.org/jira/browse/SPARK-6301 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 1.2.0 >Reporter: raju patel > > We are using Jnius to call Java functions from Python. But when we are trying > to submit the job using Spark,it is not able to load the java classes that > are provided in the --jars option, although it is successfully able to load > python class. > The Error is like this : > c = find_javaclass(clsname) > File "jnius_export_func.pxi", line 23, in jnius.find_javaclass > (jnius/jnius.c:12815) > JavaException: Class not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6275) Miss toDF() function in docs/sql-programming-guide.md
[ https://issues.apache.org/jira/browse/SPARK-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6275: - Priority: Trivial (was: Minor) Assignee: zzc This is also too minor to bother with a JIRA. > Miss toDF() function in docs/sql-programming-guide.md > -- > > Key: SPARK-6275 > URL: https://issues.apache.org/jira/browse/SPARK-6275 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.3.0 >Reporter: zzc >Assignee: zzc >Priority: Trivial > Fix For: 1.4.0 > > > Miss toDF() function in docs/sql-programming-guide.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState
[ https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358799#comment-14358799 ] Sean Owen commented on SPARK-6286: -- [~dragos] I think it would be reasonable to handle this like {{TASK_LOST}}. I agree that there is not a reason to expect Mesos will be downgraded, and the required version is already required by Spark. This is also a little important to make sure this message is handled as intended and does not cause an exception. You want to make the simple PR? > Handle TASK_ERROR in TaskState > -- > > Key: SPARK-6286 > URL: https://issues.apache.org/jira/browse/SPARK-6286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Iulian Dragos >Priority: Minor > Labels: mesos > > Scala warning: > {code} > match may not be exhaustive. It would fail on the following input: TASK_ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6301) Unable to load external jars while submitting Spark Job
[ https://issues.apache.org/jira/browse/SPARK-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6301: - Priority: Major (was: Blocker) Until it's clear what is being reported, this should not be marked "Blocker". Can you elaborate what class is not found, what you are running? Can you load the Java class without this third party library? Have you double-checked that the class is in your jar? It is not clear this is a Spark problem. > Unable to load external jars while submitting Spark Job > --- > > Key: SPARK-6301 > URL: https://issues.apache.org/jira/browse/SPARK-6301 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 1.2.0 >Reporter: raju patel > > We are using Jnius to call Java functions from Python. But when we are trying > to submit the job using Spark,it is not able to load the java classes that > are provided in the --jars option, although it is successfully able to load > python class. > The Error is like this : > c = find_javaclass(clsname) > File "jnius_export_func.pxi", line 23, in jnius.find_javaclass > (jnius/jnius.c:12815) > JavaException: Class not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6275) Miss toDF() function in docs/sql-programming-guide.md
[ https://issues.apache.org/jira/browse/SPARK-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6275. -- Resolution: Fixed Fix Version/s: 1.4.0 > Miss toDF() function in docs/sql-programming-guide.md > -- > > Key: SPARK-6275 > URL: https://issues.apache.org/jira/browse/SPARK-6275 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.3.0 >Reporter: zzc >Assignee: zzc >Priority: Trivial > Fix For: 1.4.0 > > > Miss toDF() function in docs/sql-programming-guide.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6306) Readme points to dead link
[ https://issues.apache.org/jira/browse/SPARK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358784#comment-14358784 ] Sean Owen commented on SPARK-6306: -- For a trivial change, a JIRA is just overhead. You don't need one unless there is a meaningful difference between the problem description and the fix itself. > Readme points to dead link > -- > > Key: SPARK-6306 > URL: https://issues.apache.org/jira/browse/SPARK-6306 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Theodore Vasiloudis >Priority: Trivial > Fix For: 1.4.0 > > > The link to "Specifying the Hadoop Version" now points to > http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version. > The correct link is: > http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6306) Readme points to dead link
[ https://issues.apache.org/jira/browse/SPARK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6306. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4999 [https://github.com/apache/spark/pull/4999] > Readme points to dead link > -- > > Key: SPARK-6306 > URL: https://issues.apache.org/jira/browse/SPARK-6306 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Theodore Vasiloudis >Priority: Trivial > Fix For: 1.4.0 > > > The link to "Specifying the Hadoop Version" now points to > http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version. > The correct link is: > http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6306) Readme points to dead link
[ https://issues.apache.org/jira/browse/SPARK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358687#comment-14358687 ] Apache Spark commented on SPARK-6306: - User 'thvasilo' has created a pull request for this issue: https://github.com/apache/spark/pull/4999 > Readme points to dead link > -- > > Key: SPARK-6306 > URL: https://issues.apache.org/jira/browse/SPARK-6306 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Theodore Vasiloudis >Priority: Trivial > > The link to "Specifying the Hadoop Version" now points to > http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version. > The correct link is: > http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6306) Readme points to dead link
Theodore Vasiloudis created SPARK-6306: -- Summary: Readme points to dead link Key: SPARK-6306 URL: https://issues.apache.org/jira/browse/SPARK-6306 Project: Spark Issue Type: Bug Components: Documentation Reporter: Theodore Vasiloudis Priority: Trivial The link to "Specifying the Hadoop Version" now points to http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version. The correct link is: http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358625#comment-14358625 ] Apache Spark commented on SPARK-6305: - User 'liorchaga' has created a pull request for this issue: https://github.com/apache/spark/pull/4998 > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: New Feature > Components: Build >Reporter: Tal Sliwowicz > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6305) Add support for log4j 2.x to Spark
Tal Sliwowicz created SPARK-6305: Summary: Add support for log4j 2.x to Spark Key: SPARK-6305 URL: https://issues.apache.org/jira/browse/SPARK-6305 Project: Spark Issue Type: New Feature Components: Build Reporter: Tal Sliwowicz log4j 2 requires replacing the slf4j binding and adding the log4j jars in the classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358583#comment-14358583 ] Manoj Kumar commented on SPARK-5692: I'm not sure about Eclipse, but I work just on sublime text and build it using the instructions given here. https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools > Model import/export for Word2Vec > > > Key: SPARK-5692 > URL: https://issues.apache.org/jira/browse/SPARK-5692 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: ANUPAM MEDIRATTA > > Supoort save and load for Word2VecModel. We may want to discuss whether we > want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358573#comment-14358573 ] ANUPAM MEDIRATTA commented on SPARK-5692: - I tried working on it. I am new to spark and scala. I am not able to run tests in scala ide. I am not able to compile the code base in eclipse so that I can run tests (to verify my code). any instructions on how to compile this codebase in eclipse (scala ide)? > Model import/export for Word2Vec > > > Key: SPARK-5692 > URL: https://issues.apache.org/jira/browse/SPARK-5692 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: ANUPAM MEDIRATTA > > Supoort save and load for Word2VecModel. We may want to discuss whether we > want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358529#comment-14358529 ] Meethu Mathew commented on SPARK-6227: -- [~mengxr] Please give your inputs on the same. > PCA and SVD for PySpark > --- > > Key: SPARK-6227 > URL: https://issues.apache.org/jira/browse/SPARK-6227 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.2.1 >Reporter: Julien Amelot > > The Dimensionality Reduction techniques are not available via Python (Scala + > Java only). > * Principal component analysis (PCA) > * Singular value decomposition (SVD) > Doc: > http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6256) Python MLlib API missing items: Regression
[ https://issues.apache.org/jira/browse/SPARK-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358521#comment-14358521 ] Apache Spark commented on SPARK-6256: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/4997 > Python MLlib API missing items: Regression > -- > > Key: SPARK-6256 > URL: https://issues.apache.org/jira/browse/SPARK-6256 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > This JIRA lists items missing in the Python API for this sub-package of MLlib. > This list may be incomplete, so please check again when sending a PR to add > these features to the Python API. > Also, please check for major disparities between documentation; some parts of > the Python API are less well-documented than their Scala counterparts. Some > items may be listed in the umbrella JIRA linked to this task. > LassoWithSGD > * setIntercept > * setValidateData > LinearRegressionWithSGD, RidgeRegressionWithSGD > * setValidateData -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6301) Unable to load external jars while submitting Spark Job
[ https://issues.apache.org/jira/browse/SPARK-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] raju patel updated SPARK-6301: -- Description: We are using Jnius to call Java functions from Python. But when we are trying to submit the job using Spark,it is not able to load the java classes that are provided in the --jars option, although it is successfully able to load python class. The Error is like this : c = find_javaclass(clsname) File "jnius_export_func.pxi", line 23, in jnius.find_javaclass (jnius/jnius.c:12815) JavaException: Class not found was: We are using Jnius to call Java functions from Python. But when we are trying to submit the job using Spark,it is not able to load the java classes that are provided in the --jars option although it is successfully able to load python class. The Error is like this : c = find_javaclass(clsname) File "jnius_export_func.pxi", line 23, in jnius.find_javaclass (jnius/jnius.c:12815) JavaException: Class not found > Unable to load external jars while submitting Spark Job > --- > > Key: SPARK-6301 > URL: https://issues.apache.org/jira/browse/SPARK-6301 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 1.2.0 >Reporter: raju patel >Priority: Blocker > > We are using Jnius to call Java functions from Python. But when we are trying > to submit the job using Spark,it is not able to load the java classes that > are provided in the --jars option, although it is successfully able to load > python class. > The Error is like this : > c = find_javaclass(clsname) > File "jnius_export_func.pxi", line 23, in jnius.find_javaclass > (jnius/jnius.c:12815) > JavaException: Class not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6304) Checkpointing doesn't retain driver port
[ https://issues.apache.org/jira/browse/SPARK-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marius Soutier updated SPARK-6304: -- Description: In a check-pointed Streaming application running on a fixed driver port, the setting "spark.driver.port" is not loaded when recovering from a checkpoint. (The driver is then started on a random port.) was: In a check-pointed Streaming application running on a fixed driver port, the setting "spark.driver.port" is not loaded when recovering from checkpoint. (The driver is then started on a random port.) > Checkpointing doesn't retain driver port > > > Key: SPARK-6304 > URL: https://issues.apache.org/jira/browse/SPARK-6304 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.1 >Reporter: Marius Soutier > > In a check-pointed Streaming application running on a fixed driver port, the > setting "spark.driver.port" is not loaded when recovering from a checkpoint. > (The driver is then started on a random port.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6304) Checkpointing doesn't retain driver port
Marius Soutier created SPARK-6304: - Summary: Checkpointing doesn't retain driver port Key: SPARK-6304 URL: https://issues.apache.org/jira/browse/SPARK-6304 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.1 Reporter: Marius Soutier In a check-pointed Streaming application running on a fixed driver port, the setting "spark.driver.port" is not loaded when recovering from checkpoint. (The driver is then started on a random port.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function
[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358504#comment-14358504 ] Pavel Laskov commented on SPARK-6282: - Hi Sven and Joseph, Thanks for a quick reply to my bug report. I still think the problem is somewhere in Spark. Here is an autonomous code snippet which triggers the error on my system. Uncommenting any of the imports marked with ### causes a crash. Switching to "import random / random.random()" fixes the problems. None of the functions imported in the ### lines is used in the test code. Looks like a very obscure dependency of some mllib components on _winreg? from random import random # import random from pyspark.context import SparkContext from pyspark.mllib.rand import RandomRDDs ### Any of these imports causes the crash ### from pyspark.mllib.tree import RandomForest, DecisionTreeModel ### from pyspark.mllib.linalg import SparseVector ### from pyspark.mllib.regression import LabeledPoint if __name__ == "__main__": sc = SparkContext(appName="Random() bug test") data = RandomRDDs.normalVectorRDD(sc,numRows=1,numCols=200) d = data.map(lambda x: (random(), x)) print d.first() Here is the full trace of the error: Traceback (most recent call last): File "/home/laskov/research/pe-class/python/src/experiments/test_random.py", line 16, in print d.first() File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 1139, in first rs = self.take(1) File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 1091, in take totalParts = self._jrdd.partitions().size() File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 2115, in _jrdd pickled_command = ser.dumps(command) File "/home/laskov/code/spark-1.2.1/python/pyspark/serializers.py", line 406, in dumps return cloudpickle.dumps(obj, 2) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 816, in dumps cp.dump(obj) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 133, in dump return pickle.Pickler.dump(self, obj) File "/usr/lib/python2.7/pickle.py", line 224, in dump self.save(obj) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple save(element) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 254, in save_function self.save_function_tuple(obj, [themodule]) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 304, in save_function_tuple save((code, closure, base_globals)) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple save(element) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 600, in save_list self._batch_appends(iter(obj)) File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends save(x) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 254, in save_function self.save_function_tuple(obj, [themodule]) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 304, in save_function_tuple save((code, closure, base_globals)) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple save(element) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 600, in save_list self._batch_appends(iter(obj)) File "/usr/lib/python2.7/pickle.py", line 636, in _batch_appends save(tmp[0]) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 249, in save_function self.save_function_tuple(obj, modList) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 309, in save_function_tuple save(f_globals) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 174, in save_dict pickle.Pickler.save_dict(self, obj) File "/usr/lib/python2.7/pickle.py", line 649, in save_dict self._batch_setitems(obj.iteritems()) File "/usr/lib/pyth
[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
[ https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357609#comment-14357609 ] Imran Rashid commented on SPARK-6190: - Hi [~rxin], I've been adding scatterered notes across the various tickets which I think has led to a lot of the confusion -- lemme try to summarize things here. I completely agree about the importance of the various cases. Caching large blocks is *by far* the most important case. However, I think its worth exploring the other cases now for two reasons. (a) I still think they need to be solved eventually for a consistent user experience. Eg., if caching locally works, but reading from a remote cache doesn't, a user will be baffled when on run 1 of their job, everything works fine, but run 2, with the same data & same code, tasks get scheduled slightly different and require a remote fetch, and KABOOM! thats the kind of experience that makes the average user want to throw spark out the window. (This is actually what I thought you were pointing out in your comments on the earlier jira -- that we can forget about uploading at this point, but need to make sure remote fetches work.) (b) We should make sure that whatever approach we take at least leaves the door open for solutions to all the problems. At least for myself, I wasn't sure if this approach would work for everything initially, but exploring the options makes me feel like its all possible. (which gets to your question about large blocks vs. multi-blocks.) The proposal isn't exactly "read-only", it also supports writing via {{LargeByteBufferOutputStream}}. It turns out thats all we need. The BlockManager currently exposes {{ByteBuffers}}, but it actually doesn't need to. For example, currently local shuffle fetches only expose a FileInputStream over the data -- thats why there isn't a 2GB limit on local shuffles. (it gets wrapped in a {{FileSegmentManagedBuffer}} and eventually read here: https://github.com/apache/spark/blob/55c4831d68c8326380086b5540244f984ea9ec27/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L300) It also makes sense that we only need stream access, since RDDs & broadcast vars are immutable -- eg. we never say "treat bytes 37-40 as an int, and increment its value". Fundamentally, blocks are always created via serialization -- more specifically, {{Serializer#serializeStream}}. Obviously there isn't any limit when writing to a {{FileOutputStream}}, we just need a way to write to an in-memory output stream over 2GB. We can create an {{Array\[Array\[Byte\]\]}} already with {{ByteArrayChunkOutputStream}} https://github.com/apache/spark/blob/55c4831d68c8326380086b5540244f984ea9ec27/core/src/main/scala/org/apache/spark/util/io/ByteArrayChunkOutputStream.scala (currently used to create multiple blocks by TorrentBroadcast). We can use that to write out more than 2GB, eg. creating many chunks of max size 64K. Similarly, we need a way to convert the various representations of large blocks back into {{InputStreams}}. File-based input streams have no problem ({{DiskStore}} only fails b/c the code currently tries to convert to a {{ByteBuffer}}, though conceptually this is unnecessary). For in-memory large blocks, represented as {{Array\[Array\[Byte\]\]}}, again we can do the same as {{TorrentBroadcast}}. The final case is network transfer. This involves changing the netty frame decoder to handle frames that are > 2GB -- then we just use the same input stream for the in-memory case. That was the last piece that I was prototyping, and was mentioning in my latest comments. I have an implementation available here: https://github.com/squito/spark/blob/5e83a55daa30a19840214f77681248e112635bf6/network/common/src/main/java/org/apache/spark/network/protocol/FixedChunkLargeFrameDecoder.java Its a good question about whether we should allow large blocks, or instead we should have blocks be limited at 2GB and have another layer put multiple blocks together. I don't know if I have very clear objective arguments for one vs. the other, but I did consider both and felt like this version was much simpler to implement. Especially given the limited api that is actually needed (only stream access), the changes proposed here really aren't that big. It keeps the changes more nicely contained to the layers underneath BlockManager (with mostly cosmetic / naming changes required in outer layers since we'd no longer be returning ByteBuffers). Going down this road certainly doesn't prevent us from later deciding to have blocks be fragmented (then its just a question of naming: are "blocks" the smallest units that we work with in the internals, and there is some new logical unit which wraps blocks? or are "blocks" the logical unit that is exposed, and there is some new smaller unit which is used by the internals?
[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods
[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357610#comment-14357610 ] mgdadv commented on SPARK-6189: --- While the dot is legal in R and SQL, I don't think there is a nice way of making it legal in python. So at least in the Spark python code, I think something should be done about it. I just realized that the automatic renaming can cause problems if that entry already exists. For example, what if GNP_deflator was already in the data set and then GNP.deflator gets changed. I think the best thing to do is to just warn the user by printing out a warning message. I have changed the patch accordingly. Here is some example code for pyspark: import pandas as pd df = pd.read_csv(StringIO.StringIO("a.b,a,c\n101,102,103\n201,202,203")) spdf = sqlCtx.createDataFrame(df) spdf.take(2) spdf[spdf.a==102].take(2) So far this works, but this fails: spdf[spdf.a.b==101].take(2) In pandas df.a.b doesn't work either, but the fields can be accessed via the string "a.b", i.e.: df["a.b"] > Pandas to DataFrame conversion should check field names for periods > --- > > Key: SPARK-6189 > URL: https://issues.apache.org/jira/browse/SPARK-6189 > Project: Spark > Issue Type: Improvement > Components: DataFrame, SQL >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Issue I ran into: I imported an R dataset in CSV format into a Pandas > DataFrame and then use toDF() to convert that into a Spark DataFrame. The R > dataset had a column with a period in it (column "GNP.deflator" in the > "longley" dataset). When I tried to select it using the Spark DataFrame DSL, > I could not because the DSL thought the period was selecting a field within > GNP. > Also, since "GNP" is another field's name, it gives an error which could be > obscure to users, complaining: > {code} > org.apache.spark.sql.AnalysisException: GetField is not valid on fields of > type DoubleType; > {code} > We should either handle periods in column names or check during loading and > warn/fail gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6303) Average should be in canBeCodeGened list
[ https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358399#comment-14358399 ] Apache Spark commented on SPARK-6303: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4996 > Average should be in canBeCodeGened list > > > Key: SPARK-6303 > URL: https://issues.apache.org/jira/browse/SPARK-6303 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh > > Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, > CollectHashSet. Average should be in the list too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6303) Average should be in canBeCodeGened list
Liang-Chi Hsieh created SPARK-6303: -- Summary: Average should be in canBeCodeGened list Key: SPARK-6303 URL: https://issues.apache.org/jira/browse/SPARK-6303 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, CollectHashSet. Average should be in the list too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6227) PCA and SVD for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6227: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-6100 > PCA and SVD for PySpark > --- > > Key: SPARK-6227 > URL: https://issues.apache.org/jira/browse/SPARK-6227 > Project: Spark > Issue Type: Sub-task > Components: MLlib, PySpark >Affects Versions: 1.2.1 >Reporter: Julien Amelot >Priority: Minor > > The Dimensionality Reduction techniques are not available via Python (Scala + > Java only). > * Principal component analysis (PCA) > * Singular value decomposition (SVD) > Doc: > http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState
[ https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357310#comment-14357310 ] Iulian Dragos commented on SPARK-6286: -- Good point. It's been [introduced in 0.21.0|http://mesos.apache.org/blog/mesos-0-21-0-released/]. According to [pom.xml|https://github.com/apache/spark/blob/master/pom.xml#L119], Spark depends on `0.21.0`, so it seems safe to handle it. Feel free to close if you think it's going to break something else. > Handle TASK_ERROR in TaskState > -- > > Key: SPARK-6286 > URL: https://issues.apache.org/jira/browse/SPARK-6286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Iulian Dragos >Priority: Minor > Labels: mesos > > Scala warning: > {code} > match may not be exhaustive. It would fail on the following input: TASK_ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6198) Support "select current_database()"
[ https://issues.apache.org/jira/browse/SPARK-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358361#comment-14358361 ] Apache Spark commented on SPARK-6198: - User 'DoingDone9' has created a pull request for this issue: https://github.com/apache/spark/pull/4995 > Support "select current_database()" > --- > > Key: SPARK-6198 > URL: https://issues.apache.org/jira/browse/SPARK-6198 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.1 >Reporter: DoingDone9 > > The method(evaluate) has changed in UDFCurrentDB, it just throws a > exception.But hiveUdfs call this method and failed. > @Override > public Object evaluate(DeferredObject[] arguments) throws HiveException { > throw new IllegalStateException("never"); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6285) Duplicated code leads to errors
[ https://issues.apache.org/jira/browse/SPARK-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357255#comment-14357255 ] Sean Owen commented on SPARK-6285: -- I do not observe any compilation problem in Maven or IntelliJ though, so I don't know if it's an actual problem in the source. That said, I don't see why there are two copies of the same class; one can be removed. But the containing class in the main source tree looks like test code. I think you can try moving it to the test tree too as part of a fix. ParquetTest is only used from test code, and ParquetTestData is... only used in sql's README.md? maybe my IDE is reading that wrong. > Duplicated code leads to errors > --- > > Key: SPARK-6285 > URL: https://issues.apache.org/jira/browse/SPARK-6285 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: Iulian Dragos > > The following class is duplicated inside > [ParquetTestData|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala#L39] > and > [ParquetIOSuite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala#L44], > with exact same code and fully qualified name: > {code} > org.apache.spark.sql.parquet.TestGroupWriteSupport > {code} > The second one was introduced in > [3b395e10|https://github.com/apache/spark/commit/3b395e10510782474789c9098084503f98ca4830], > but even though it mentions that `ParquetTestData` should be removed later, > I couldn't find a corresponding Jira ticket. > This duplicate class causes the Eclipse builder to fail (since src/main and > src/test are compiled together in Eclipse, unlike Sbt). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6284) Support framework authentication and role in Mesos framework
[ https://issues.apache.org/jira/browse/SPARK-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357180#comment-14357180 ] Timothy Chen commented on SPARK-6284: - https://github.com/apache/spark/pull/4960 > Support framework authentication and role in Mesos framework > > > Key: SPARK-6284 > URL: https://issues.apache.org/jira/browse/SPARK-6284 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > > Support framework authentication and role in both Coarse grain and fine grain > mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5987) Model import/export for GaussianMixtureModel
[ https://issues.apache.org/jira/browse/SPARK-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357109#comment-14357109 ] Joseph K. Bradley commented on SPARK-5987: -- This isn't a bug in Spark SQL. The issue is that we haven't defined a UserDefinedType for Matrices. (We should, but haven't yet.) When I said "basic types," I meant the types enumerated on the SQL programming guide (basically, Array[Double] or Seq[Double] will be best). I'd recommend flattening the matrix into an Array[Double] instead of having nested types. The nesting is less efficient because of all of the extra objects it creates. > Model import/export for GaussianMixtureModel > > > Key: SPARK-5987 > URL: https://issues.apache.org/jira/browse/SPARK-5987 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar > > Support save/load for GaussianMixtureModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6302) GeneratedAggregate uses wrong schema on updateProjection
[ https://issues.apache.org/jira/browse/SPARK-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358333#comment-14358333 ] Apache Spark commented on SPARK-6302: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4994 > GeneratedAggregate uses wrong schema on updateProjection > > > Key: SPARK-6302 > URL: https://issues.apache.org/jira/browse/SPARK-6302 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > > The updateProjection in GeneratedAggregate now uses the updateSchema as its > input schema. In fact, the schema should be child.output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6301) Unable to load external jars while submitting Spark Job
[ https://issues.apache.org/jira/browse/SPARK-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] raju patel updated SPARK-6301: -- Description: We are using Jnius to call Java functions from Python. But when we are trying to submit the job using Spark,it is not able to load the java classes that are provided in the --jars option although it is successfully able to load python class. The Error is like this : c = find_javaclass(clsname) File "jnius_export_func.pxi", line 23, in jnius.find_javaclass (jnius/jnius.c:12815) JavaException: Class not found was: We are using Jnius to call Java functions from Python. But when we are trying to submit the job using Spark,it is not able to load the java classes that are provided in the --jars option although it is successfully able to load python class. The Error is like this : c = find_javaclass(clsname) File "jnius_export_func.pxi", line 23, in jnius.find_javaclass (jnius/jnius.c:12815) > Unable to load external jars while submitting Spark Job > --- > > Key: SPARK-6301 > URL: https://issues.apache.org/jira/browse/SPARK-6301 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 1.2.0 >Reporter: raju patel >Priority: Blocker > > We are using Jnius to call Java functions from Python. But when we are trying > to submit the job using Spark,it is not able to load the java classes that > are provided in the --jars option although it is successfully able to load > python class. > The Error is like this : > c = find_javaclass(clsname) > File "jnius_export_func.pxi", line 23, in jnius.find_javaclass > (jnius/jnius.c:12815) > JavaException: Class not found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6302) GeneratedAggregate uses wrong schema on updateProjection
Liang-Chi Hsieh created SPARK-6302: -- Summary: GeneratedAggregate uses wrong schema on updateProjection Key: SPARK-6302 URL: https://issues.apache.org/jira/browse/SPARK-6302 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor The updateProjection in GeneratedAggregate now uses the updateSchema as its input schema. In fact, the schema should be child.output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357069#comment-14357069 ] Manoj Kumar commented on SPARK-5692: okay, great > Model import/export for Word2Vec > > > Key: SPARK-5692 > URL: https://issues.apache.org/jira/browse/SPARK-5692 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: ANUPAM MEDIRATTA > > Supoort save and load for Word2VecModel. We may want to discuss whether we > want to be compatible with the original Word2Vec model storage format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org