date:20150312

[jira] [Reopened] (SPARK-6145) ORDER BY fails to resolve nested fields

2015-03-12 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-6145:
-
  Assignee: Michael Armbrust

> ORDER BY fails to resolve nested fields
> ---
>
> Key: SPARK-6145
> URL: https://issues.apache.org/jira/browse/SPARK-6145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.3.0
>
>
> {code}
> sqlContext.jsonRDD(sc.parallelize(
>   """{"a": {"b": 1}, "c": 1}""" :: Nil)).registerTempTable("nestedOrder")
> // Works
> sqlContext.sql("SELECT 1 FROM nestedOrder ORDER BY c")
> // Fails now
> sqlContext.sql("SELECT 1 FROM nestedOrder ORDER BY a.b")
> // Fails now
> sqlContext.sql("SELECT a.b FROM nestedOrder ORDER BY a.b")
> {code}
> Relatedly the error message for bad get fields should also include the name 
> of the field in question.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6315) SparkSQL 1.3.0 (RC3) fails to read parquet file generated by 1.1.1

2015-03-12 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-6315:
---

 Summary: SparkSQL 1.3.0 (RC3) fails to read parquet file generated 
by 1.1.1
 Key: SPARK-6315
 URL: https://issues.apache.org/jira/browse/SPARK-6315
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker


Parquet files generated by Spark 1.1 have a deprecated representation of the 
schema.  In Spark 1.3 we fail to read these files through the new Parquet code 
path.  We should continue to read these files until we formally deprecate this 
representation.

As a workaround:
{code}
SET spark.sql.parquet.useDataSourceApi=false
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6279) Miss expressions flag "s" at logging string

2015-03-12 Thread zzc (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360013#comment-14360013
 ] 

zzc commented on SPARK-6279:


[~srowen], I am new to Spark and JIRA, Sorry for this

> Miss expressions flag "s" at logging string 
> 
>
> Key: SPARK-6279
> URL: https://issues.apache.org/jira/browse/SPARK-6279
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: zzc
>Assignee: zzc
>Priority: Trivial
> Fix For: 1.4.0
>
>
> In KafkaRDD.scala, Miss expressions flag "s" at logging string
> In logging file, it print `Beginning offset ${part.fromOffset} is the same as 
> ending offset ` but not `log.warn("Beginning offset 111 is the same as ending 
> offset "`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6275) Miss toDF() function in docs/sql-programming-guide.md

2015-03-12 Thread zzc (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14360012#comment-14360012
 ] 

zzc commented on SPARK-6275:


[~srowen], I am new to Spark and JIRA, Sorry for this

> Miss toDF() function in docs/sql-programming-guide.md 
> --
>
> Key: SPARK-6275
> URL: https://issues.apache.org/jira/browse/SPARK-6275
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: zzc
>Assignee: zzc
>Priority: Trivial
> Fix For: 1.4.0
>
>
> Miss toDF() function in docs/sql-programming-guide.md 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6314) Failed to load application log data from FileStatus

2015-03-12 Thread zzc (JIRA)

zzc created SPARK-6314:
--

 Summary: Failed to load application log data from FileStatus
 Key: SPARK-6314
 URL: https://issues.apache.org/jira/browse/SPARK-6314
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: zzc


There are some errors in history server event-log directory while a job is 
running:
{quote}
com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was 
expecting closing '"' for name
at 
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName2(ReaderBasedJsonParser.java:1284)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName(ReaderBasedJsonParser.java:1268)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:618)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:43)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:49)
at 
org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:260)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$6.apply(FsHistoryProvider.scala:190)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$6.apply(FsHistoryProvider.scala:188)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:188)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anon$1$$anonfun$run$1.apply$mcV$sp(FsHistoryProvider.scala:94)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anon$1$$anonfun$run$1.apply(FsHistoryProvider.scala:85)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anon$1$$anonfun$run$1.apply(FsHistoryProvider.scala:85)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anon$1.run(FsHistoryProvider.scala:84)
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount

2015-03-12 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6313:
---
Priority: Critical  (was: Major)

> Fetch File Lock file creation doesnt work when Spark working dir is on a NFS 
> mount
> --
>
> Key: SPARK-6313
> URL: https://issues.apache.org/jira/browse/SPARK-6313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Nathan McCarthy
>Priority: Critical
>
> When running in cluster mode and mounting the spark work dir on a NFS volume 
> (or some volume which doesn't support file locking), the fetchFile (used for 
> downloading JARs etc on the executors) method in Spark Utils class will fail. 
> This file locking was introduced as an improvement with SPARK-2713. 
> See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
>  
> Introduced in 1.2 in commit; 
> https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 
> As this locking is for optimisation for fetching files, could we take a 
> different approach here to create a temp/advisory lock file? 
> Typically you would just mount local disks (in say ext4 format) and provide 
> this as a comma separated list however we are trying to run Spark on MapR. 
> With MapR we can do a loop back mount to a volume on the local node and take 
> advantage of MapRs disk pools. This also means we dont need specific mounts 
> for Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount

2015-03-12 Thread Nathan McCarthy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359972#comment-14359972
 ] 

Nathan McCarthy edited comment on SPARK-6313 at 3/13/15 5:38 AM:
-

Since the {code}val lockFileName = s"${url.hashCode}${timestamp}_lock"{code} 
uses a timestamp I can't see there being too many problems with hanging/left 
over lock files. 


was (Author: nemccarthy):
Since the `val lockFileName = s"${url.hashCode}${timestamp}_lock"` uses a 
timestamp I can't see there being too many problems with hanging/left over lock 
files. 

> Fetch File Lock file creation doesnt work when Spark working dir is on a NFS 
> mount
> --
>
> Key: SPARK-6313
> URL: https://issues.apache.org/jira/browse/SPARK-6313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Nathan McCarthy
>
> When running in cluster mode and mounting the spark work dir on a NFS volume 
> (or some volume which doesn't support file locking), the fetchFile (used for 
> downloading JARs etc on the executors) method in Spark Utils class will fail. 
> This file locking was introduced as an improvement with SPARK-2713. 
> See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
>  
> Introduced in 1.2 in commit; 
> https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 
> As this locking is for optimisation for fetching files, could we take a 
> different approach here to create a temp/advisory lock file? 
> Typically you would just mount local disks (in say ext4 format) and provide 
> this as a comma separated list however we are trying to run Spark on MapR. 
> With MapR we can do a loop back mount to a volume on the local node and take 
> advantage of MapRs disk pools. This also means we dont need specific mounts 
> for Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount

2015-03-12 Thread Nathan McCarthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan McCarthy updated SPARK-6313:
---
Affects Version/s: 1.2.0
   1.2.1

> Fetch File Lock file creation doesnt work when Spark working dir is on a NFS 
> mount
> --
>
> Key: SPARK-6313
> URL: https://issues.apache.org/jira/browse/SPARK-6313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0, 1.2.1
>Reporter: Nathan McCarthy
>
> When running in cluster mode and mounting the spark work dir on a NFS volume 
> (or some volume which doesn't support file locking), the fetchFile (used for 
> downloading JARs etc on the executors) method in Spark Utils class will fail. 
> This file locking was introduced as an improvement with SPARK-2713. 
> See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
>  
> Introduced in 1.2 in commit; 
> https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 
> As this locking is for optimisation for fetching files, could we take a 
> different approach here to create a temp/advisory lock file? 
> Typically you would just mount local disks (in say ext4 format) and provide 
> this as a comma separated list however we are trying to run Spark on MapR. 
> With MapR we can do a loop back mount to a volume on the local node and take 
> advantage of MapRs disk pools. This also means we dont need specific mounts 
> for Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount

2015-03-12 Thread Nathan McCarthy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359972#comment-14359972
 ] 

Nathan McCarthy commented on SPARK-6313:


Since the `val lockFileName = s"${url.hashCode}${timestamp}_lock"` uses a 
timestamp I can't see there being too many problems with hanging/left over lock 
files. 

> Fetch File Lock file creation doesnt work when Spark working dir is on a NFS 
> mount
> --
>
> Key: SPARK-6313
> URL: https://issues.apache.org/jira/browse/SPARK-6313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Nathan McCarthy
>
> When running in cluster mode and mounting the spark work dir on a NFS volume 
> (or some volume which doesn't support file locking), the fetchFile (used for 
> downloading JARs etc on the executors) method in Spark Utils class will fail. 
> This file locking was introduced as an improvement with SPARK-2713. 
> See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
>  
> Introduced in 1.2 in commit; 
> https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 
> As this locking is for optimisation for fetching files, could we take a 
> different approach here to create a temp/advisory lock file? 
> Typically you would just mount local disks (in say ext4 format) and provide 
> this as a comma separated list however we are trying to run Spark on MapR. 
> With MapR we can do a loop back mount to a volume on the local node and take 
> advantage of MapRs disk pools. This also means we dont need specific mounts 
> for Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount

2015-03-12 Thread Nathan McCarthy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359964#comment-14359964
 ] 

Nathan McCarthy commented on SPARK-6313:


Suggestion along the lines of;

https://github.com/apache/lucene-solr/blob/5314a56924f46522993baf106e6deca0e48a967f/lucene/core/src/java/org/apache/lucene/store/SimpleFSLockFactory.java
 
or
https://github.com/graphhopper/graphhopper/blob/master/core/src/main/java/com/graphhopper/storage/SimpleFSLockFactory.java


> Fetch File Lock file creation doesnt work when Spark working dir is on a NFS 
> mount
> --
>
> Key: SPARK-6313
> URL: https://issues.apache.org/jira/browse/SPARK-6313
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Nathan McCarthy
>
> When running in cluster mode and mounting the spark work dir on a NFS volume 
> (or some volume which doesn't support file locking), the fetchFile (used for 
> downloading JARs etc on the executors) method in Spark Utils class will fail. 
> This file locking was introduced as an improvement with SPARK-2713. 
> See 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
>  
> Introduced in 1.2 in commit; 
> https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 
> As this locking is for optimisation for fetching files, could we take a 
> different approach here to create a temp/advisory lock file? 
> Typically you would just mount local disks (in say ext4 format) and provide 
> this as a comma separated list however we are trying to run Spark on MapR. 
> With MapR we can do a loop back mount to a volume on the local node and take 
> advantage of MapRs disk pools. This also means we dont need specific mounts 
> for Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

2015-03-12 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359950#comment-14359950
 ] 

Tathagata Das edited comment on SPARK-6222 at 3/13/15 5:09 AM:
---

I proposed another way to fix this here
https://github.com/apache/spark/pull/5008
Basically, dont clear checkpoint data after the pre-batch-start checkpoint. 

BTW, super thanks to [~hshreedharan] for painstakingly explaining me offline 
what the problem was.


was (Author: tdas):
I proposed another way to fix this here
https://github.com/apache/spark/pull/5008
Basically, dont clear checkpoint data after the pre-batch-start checkpoint. 

BTW, super thanks to [~hshreedharan] for painstakingly explaining me offline 
what the problem was. I

> [STREAMING] All data may not be recovered from WAL when driver is killed
> 
>
> Key: SPARK-6222
> URL: https://issues.apache.org/jira/browse/SPARK-6222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Hari Shreedharan
>Priority: Blocker
> Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch
>
>
> When testing for our next release, our internal tests written by [~wypoon] 
> caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs 
> FlumePolling stream to read data from Flume, then kills the Application 
> Master. Once YARN restarts it, the test waits until no more data is to be 
> written and verifies the original against the data on HDFS. This was passing 
> in 1.2.0, but is failing now.
> Since the test ties into Cloudera's internal infrastructure and build 
> process, it cannot be directly run on an Apache build. But I have been 
> working on isolating the commit that may have caused the regression. I have 
> confirmed that it was caused by SPARK-5147 (PR # 
> [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several 
> times using the test and the failure is consistently reproducible. 
> To re-confirm, I reverted just this one commit (and Clock consolidation one 
> to avoid conflicts), and the issue was no longer reproducible.
> Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
> /cc [~tdas], [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

2015-03-12 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359950#comment-14359950
 ] 

Tathagata Das commented on SPARK-6222:
--

I proposed another way to fix this here
https://github.com/apache/spark/pull/5008
Basically, dont clear checkpoint data after the pre-batch-start checkpoint. 

BTW, super thanks to [~hshreedharan] for painstakingly explaining me offline 
what the problem was. I

> [STREAMING] All data may not be recovered from WAL when driver is killed
> 
>
> Key: SPARK-6222
> URL: https://issues.apache.org/jira/browse/SPARK-6222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Hari Shreedharan
>Priority: Blocker
> Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch
>
>
> When testing for our next release, our internal tests written by [~wypoon] 
> caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs 
> FlumePolling stream to read data from Flume, then kills the Application 
> Master. Once YARN restarts it, the test waits until no more data is to be 
> written and verifies the original against the data on HDFS. This was passing 
> in 1.2.0, but is failing now.
> Since the test ties into Cloudera's internal infrastructure and build 
> process, it cannot be directly run on an Apache build. But I have been 
> working on isolating the commit that may have caused the regression. I have 
> confirmed that it was caused by SPARK-5147 (PR # 
> [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several 
> times using the test and the failure is consistently reproducible. 
> To re-confirm, I reverted just this one commit (and Clock consolidation one 
> to avoid conflicts), and the issue was no longer reproducible.
> Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
> /cc [~tdas], [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6222) [STREAMING] All data may not be recovered from WAL when driver is killed

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359948#comment-14359948
 ] 

Apache Spark commented on SPARK-6222:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/5008

> [STREAMING] All data may not be recovered from WAL when driver is killed
> 
>
> Key: SPARK-6222
> URL: https://issues.apache.org/jira/browse/SPARK-6222
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Hari Shreedharan
>Priority: Blocker
> Attachments: AfterPatch.txt, CleanWithoutPatch.txt, SPARK-6122.patch
>
>
> When testing for our next release, our internal tests written by [~wypoon] 
> caught a regression in Spark Streaming between 1.2.0 and 1.3.0. The test runs 
> FlumePolling stream to read data from Flume, then kills the Application 
> Master. Once YARN restarts it, the test waits until no more data is to be 
> written and verifies the original against the data on HDFS. This was passing 
> in 1.2.0, but is failing now.
> Since the test ties into Cloudera's internal infrastructure and build 
> process, it cannot be directly run on an Apache build. But I have been 
> working on isolating the commit that may have caused the regression. I have 
> confirmed that it was caused by SPARK-5147 (PR # 
> [4149|https://github.com/apache/spark/pull/4149]). I confirmed this several 
> times using the test and the failure is consistently reproducible. 
> To re-confirm, I reverted just this one commit (and Clock consolidation one 
> to avoid conflicts), and the issue was no longer reproducible.
> Since this is a data loss issue, I believe this is a blocker for Spark 1.3.0
> /cc [~tdas], [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5376) [Mesos] MesosExecutor should have correct resources

2015-03-12 Thread Lukasz Jastrzebski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359943#comment-14359943
 ] 

Lukasz Jastrzebski commented on SPARK-5376:
---

One comment, however if you run multiple Spark applications even tough 
executor-id == slave-id, multiple executors can be started on the same host. 
(And every one of them will consume 1 CPU without scheduling any tasks). This 
can be painful when you want to run multiple streaming applications on Mesos in 
fine grained mode, because each streaming driver's executors will consume 1 
CPU...

> [Mesos] MesosExecutor should have correct resources
> ---
>
> Key: SPARK-5376
> URL: https://issues.apache.org/jira/browse/SPARK-5376
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.2.0
>Reporter: Jongyoul Lee
>
> Spark offers task and executor resources. We should fix resources for 
> executor. As is, same cores as tasks and no memories.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6313) Fetch File Lock file creation doesnt work when Spark working dir is on a NFS mount

2015-03-12 Thread Nathan McCarthy (JIRA)

Nathan McCarthy created SPARK-6313:
--

 Summary: Fetch File Lock file creation doesnt work when Spark 
working dir is on a NFS mount
 Key: SPARK-6313
 URL: https://issues.apache.org/jira/browse/SPARK-6313
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Nathan McCarthy


When running in cluster mode and mounting the spark work dir on a NFS volume 
(or some volume which doesn't support file locking), the fetchFile (used for 
downloading JARs etc on the executors) method in Spark Utils class will fail. 
This file locking was introduced as an improvement with SPARK-2713. 

See 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L415
 

Introduced in 1.2 in commit; 
https://github.com/apache/spark/commit/7aacb7bfad4ec73fd8f18555c72ef696 

As this locking is for optimisation for fetching files, could we take a 
different approach here to create a temp/advisory lock file? 

Typically you would just mount local disks (in say ext4 format) and provide 
this as a comma separated list however we are trying to run Spark on MapR. With 
MapR we can do a loop back mount to a volume on the local node and take 
advantage of MapRs disk pools. This also means we dont need specific mounts for 
Spark and improves the generic nature of the cluster. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6311) ChiSqTest should check for too few counts

2015-03-12 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6311.

Resolution: Duplicate

> ChiSqTest should check for too few counts
> -
>
> Key: SPARK-6311
> URL: https://issues.apache.org/jira/browse/SPARK-6311
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> ChiSqTest assumes that elements of the contingency matrix are large enough 
> (have enough counts) s.t. the central limit theorem kicks in.  It would be 
> reasonable to do one or more of the following:
> * Add a note in the docs about making sure there are a reasonable number of 
> instances being used (or counts in the contingency table entries, to be more 
> precise and account for skewed category distributions).
> * Add a check in the code which could:
> ** Log a warning message
> ** Alter the p-value to make sure it indicates the test result is 
> insignificant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6310) ChiSqTest should check for too few counts

2015-03-12 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-6310.

Resolution: Duplicate

> ChiSqTest should check for too few counts
> -
>
> Key: SPARK-6310
> URL: https://issues.apache.org/jira/browse/SPARK-6310
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> ChiSqTest assumes that elements of the contingency matrix are large enough 
> (have enough counts) s.t. the central limit theorem kicks in.  It would be 
> reasonable to do one or more of the following:
> * Add a note in the docs about making sure there are a reasonable number of 
> instances being used (or counts in the contingency table entries, to be more 
> precise and account for skewed category distributions).
> * Add a check in the code which could:
> ** Log a warning message
> ** Alter the p-value to make sure it indicates the test result is 
> insignificant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3066) Support recommendAll in matrix factorization model

2015-03-12 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359892#comment-14359892
 ] 

Debasish Das commented on SPARK-3066:
-

We use the non-level 3 BLAS code in our internal flows with ~ 60M x 3M 
datasets...Runtime is decent...I am moving to level 3 BLAS for 4823 and I think 
the speed will improve further 

> Support recommendAll in matrix factorization model
> --
>
> Key: SPARK-3066
> URL: https://issues.apache.org/jira/browse/SPARK-3066
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Debasish Das
>
> ALS returns a matrix factorization model, which we can use to predict ratings 
> for individual queries as well as small batches. In practice, users may want 
> to compute top-k recommendations offline for all users. It is very expensive 
> but a common problem. We can do some optimization like
> 1) collect one side (either user or product) and broadcast it as a matrix
> 2) use level-3 BLAS to compute inner products
> 3) use Utils.takeOrdered to find top-k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6299) ClassNotFoundException when running groupByKey with class defined in REPL.

2015-03-12 Thread Kevin (Sangwoo) Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359789#comment-14359789
 ] 

Kevin (Sangwoo) Kim commented on SPARK-6299:


Hi Sean, 
Surely it should work, I guess this is quite common pattern while working with 
spark shell. 
(This code works in Spark 1.1.1) 


> ClassNotFoundException when running groupByKey with class defined in REPL.
> --
>
> Key: SPARK-6299
> URL: https://issues.apache.org/jira/browse/SPARK-6299
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kevin (Sangwoo) Kim
>Priority: Critical
>
> Anyone can reproduce this issue by the code below
> (runs well in local mode, got exception with clusters)
> (it runs well in Spark 1.1.1)
> case class ClassA(value: String)
> val rdd = sc.parallelize(List(("k1", ClassA("v1")), ("k1", ClassA("v2")) ))
> rdd.groupByKey.collect
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 
> in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 
> 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): 
> java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:274)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
> at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at sca

[jira] [Created] (SPARK-6312) ChiSqTest should check for too few counts

2015-03-12 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6312:


 Summary: ChiSqTest should check for too few counts
 Key: SPARK-6312
 URL: https://issues.apache.org/jira/browse/SPARK-6312
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley


ChiSqTest assumes that elements of the contingency matrix are large enough 
(have enough counts) s.t. the central limit theorem kicks in.  It would be 
reasonable to do one or more of the following:
* Add a note in the docs about making sure there are a reasonable number of 
instances being used (or counts in the contingency table entries, to be more 
precise and account for skewed category distributions).
* Add a check in the code which could:
** Log a warning message
** Alter the p-value to make sure it indicates the test result is insignificant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6311) ChiSqTest should check for too few counts

2015-03-12 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6311:


 Summary: ChiSqTest should check for too few counts
 Key: SPARK-6311
 URL: https://issues.apache.org/jira/browse/SPARK-6311
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley


ChiSqTest assumes that elements of the contingency matrix are large enough 
(have enough counts) s.t. the central limit theorem kicks in.  It would be 
reasonable to do one or more of the following:
* Add a note in the docs about making sure there are a reasonable number of 
instances being used (or counts in the contingency table entries, to be more 
precise and account for skewed category distributions).
* Add a check in the code which could:
** Log a warning message
** Alter the p-value to make sure it indicates the test result is insignificant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6310) ChiSqTest should check for too few counts

2015-03-12 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-6310:


 Summary: ChiSqTest should check for too few counts
 Key: SPARK-6310
 URL: https://issues.apache.org/jira/browse/SPARK-6310
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley


ChiSqTest assumes that elements of the contingency matrix are large enough 
(have enough counts) s.t. the central limit theorem kicks in.  It would be 
reasonable to do one or more of the following:
* Add a note in the docs about making sure there are a reasonable number of 
instances being used (or counts in the contingency table entries, to be more 
precise and account for skewed category distributions).
* Add a check in the code which could:
** Log a warning message
** Alter the p-value to make sure it indicates the test result is insignificant



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6308) VectorUDT is displayed as `vecto` in dtypes

2015-03-12 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-6308:


 Summary: VectorUDT is displayed as `vecto` in dtypes
 Key: SPARK-6308
 URL: https://issues.apache.org/jira/browse/SPARK-6308
 Project: Spark
  Issue Type: Bug
  Components: MLlib, SQL
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


VectorUDT should override simpleString instead of relying on the default 
implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6309) Add MatrixUDT to support dense/sparse matrices in DataFrames

2015-03-12 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-6309:


 Summary: Add MatrixUDT to support dense/sparse matrices in 
DataFrames
 Key: SPARK-6309
 URL: https://issues.apache.org/jira/browse/SPARK-6309
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, SQL
Reporter: Xiangrui Meng


This should support both dense and sparse matrices, similar to VectorUDT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6210) Generated column name should not include id of column in it.

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359705#comment-14359705
 ] 

Apache Spark commented on SPARK-6210:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/5006

> Generated column name should not include id of column in it.
> 
>
> Key: SPARK-6210
> URL: https://issues.apache.org/jira/browse/SPARK-6210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> df.groupBy().max('age').collect()
> [Row(MAX(age#0)=5)]
> >>> df3.groupBy().max('age', 'height').collect()
> [Row(MAX(age#4L)=5, MAX(height#5L)=85)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6210) Generated column name should not include id of column in it.

2015-03-12 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-6210:
-

Assignee: Davies Liu  (was: Michael Armbrust)

> Generated column name should not include id of column in it.
> 
>
> Key: SPARK-6210
> URL: https://issues.apache.org/jira/browse/SPARK-6210
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> df.groupBy().max('age').collect()
> [Row(MAX(age#0)=5)]
> >>> df3.groupBy().max('age', 'height').collect()
> [Row(MAX(age#4L)=5, MAX(height#5L)=85)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359695#comment-14359695
 ] 

Apache Spark commented on SPARK-2426:
-

User 'debasish83' has created a pull request for this issue:
https://github.com/apache/spark/pull/5005

> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master

2015-03-12 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359665#comment-14359665
 ] 

Nicholas Chammas commented on SPARK-5189:
-

For the record, this is the script I used to get the launch time stats above:

{code}
{
python -m timeit -r 6 -n 1 \
--setup 'import subprocess; import time; subprocess.call("yes y | 
./ec2/spark-ec2 destroy launch-test --identity-file /path/to/file.pem 
--key-pair my-pair --region us-east-1", shell=True); time.sleep(60)' \
'subprocess.call("./ec2/spark-ec2 launch launch-test --slaves 99 
--identity-file /path/to/file.pem --key-pair my-pair --region us-east-1 --zone 
us-east-1c --instance-type m3.large", shell=True)'

yes y | ./ec2/spark-ec2 destroy launch-test --identity-file 
/path/to/file.pem --key-pair my-pair --region us-east-1
}
{code}

> Reorganize EC2 scripts so that nodes can be provisioned independent of Spark 
> master
> ---
>
> Key: SPARK-5189
> URL: https://issues.apache.org/jira/browse/SPARK-5189
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>
> As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
> then setting up all the slaves together. This includes broadcasting files 
> from the lonely master to potentially hundreds of slaves.
> There are 2 main problems with this approach:
> # Broadcasting files from the master to all slaves using 
> [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
> (e.g. during [ephemeral-hdfs 
> init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
>  or during [Spark 
> setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
>  takes a long time. This time increases as the number of slaves increases.
>  I did some testing in {{us-east-1}}. This is, concretely, what the problem 
> looks like:
>  || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
> | 1 | 8m 44s |
> | 10 | 13m 45s |
> | 25 | 22m 50s |
> | 50 | 37m 30s |
> | 75 | 51m 30s |
> | 99 | 1h 5m 30s |
>  Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, 
> but I think the point is clear enough.
> # It's more complicated to add slaves to an existing cluster (a la 
> [SPARK-2008]), since slaves are only configured through the master during the 
> setup of the master itself.
> Logically, the operations we want to implement are:
> * Provision a Spark node
> * Join a node to a cluster (including an empty cluster) as either a master or 
> a slave
> * Remove a node from a cluster
> We need our scripts to roughly be organized to match the above operations. 
> The goals would be:
> # When launching a cluster, enable all cluster nodes to be provisioned in 
> parallel, removing the master-to-slave file broadcast bottleneck.
> # Facilitate cluster modifications like adding or removing nodes.
> # Enable exploration of infrastructure tools like 
> [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
> internals and perhaps even allow us to build [one tool that launches Spark 
> clusters on several different cloud 
> platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].
> More concretely, the modifications we need to make are:
> * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
> equivalent, slave-side operations.
> * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure 
> it fully creates a node that can be used as either a master or slave.
> * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
> configures it as a master or slave, and joins it to a cluster.
> * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
> that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master

2015-03-12 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5189:

Description: 
As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
then setting up all the slaves together. This includes broadcasting files from 
the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:
# Broadcasting files from the master to all slaves using 
[{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
(e.g. during [ephemeral-hdfs 
init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
 or during [Spark 
setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
 takes a long time. This time increases as the number of slaves increases.
 I did some testing in {{us-east-1}}. This is, concretely, what the problem 
looks like:
 || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
| 1 | 8m 44s |
| 10 | 13m 45s |
| 25 | 22m 50s |
| 50 | 37m 30s |
| 75 | 51m 30s |
| 99 | 1h 5m 30s |
 Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but 
I think the point is clear enough.
# It's more complicated to add slaves to an existing cluster (a la 
[SPARK-2008]), since slaves are only configured through the master during the 
setup of the master itself.

Logically, the operations we want to implement are:

* Provision a Spark node
* Join a node to a cluster (including an empty cluster) as either a master or a 
slave
* Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The 
goals would be:
# When launching a cluster, enable all cluster nodes to be provisioned in 
parallel, removing the master-to-slave file broadcast bottleneck.
# Facilitate cluster modifications like adding or removing nodes.
# Enable exploration of infrastructure tools like 
[Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
internals and perhaps even allow us to build [one tool that launches Spark 
clusters on several different cloud 
platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].

More concretely, the modifications we need to make are:
* Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
equivalent, slave-side operations.
* Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it 
fully creates a node that can be used as either a master or slave.
* Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
configures it as a master or slave, and joins it to a cluster.
* Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
that script.

  was:
As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
then setting up all the slaves together. This includes broadcasting files from 
the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:
# Broadcasting files from the master to all slaves using 
[{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
(e.g. during [ephemeral-hdfs 
init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
 or during [Spark 
setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
 takes a long time. This time increases as the number of slaves increases.
# It's more complicated to add slaves to an existing cluster (a la 
[SPARK-2008]), since slaves are only configured through the master during the 
setup of the master itself.

Logically, the operations we want to implement are:

* Provision a Spark node
* Join a node to a cluster (including an empty cluster) as either a master or a 
slave
* Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The 
goals would be:
# When launching a cluster, enable all cluster nodes to be provisioned in 
parallel, removing the master-to-slave file broadcast bottleneck.
# Facilitate cluster modifications like adding or removing nodes.
# Enable exploration of infrastructure tools like 
[Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
internals and perhaps even allow us to build [one tool that launches Spark 
clusters on several different cloud 
platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].

More concretely, the modifications we need to make are:
* Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
equivalent, slave-side operations.
* Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it 
fully creates a node that can be used as either a master or slave.
* Create a new script, {{join-to-cluster.sh}}, that takes a provisioned no

[jira] [Resolved] (SPARK-4588) Add API for feature attributes

2015-03-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4588.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4925
[https://github.com/apache/spark/pull/4925]

> Add API for feature attributes
> --
>
> Key: SPARK-4588
> URL: https://issues.apache.org/jira/browse/SPARK-4588
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Sean Owen
>Priority: Critical
> Fix For: 1.4.0
>
>
> Feature attributes, e.g., continuous/categorical, feature names, feature 
> dimension, number of categories, number of nonzeros (support) could be useful 
> for ML algorithms.
> In SPARK-3569, we added metadata to schema, which can be used to store 
> feature attributes along with the dataset. We need to provide a wrapper over 
> the Metadata class for ML usage.
> The design doc is available at 
> https://docs.google.com/document/d/1796XfSzFbZvGWFs0ky99AJhlqkOBRG1O2bUxK2N4Grk/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5389) spark-shell.cmd does not run from DOS Windows 7

2015-03-12 Thread Yana Kadiyska (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359614#comment-14359614
 ] 

Yana Kadiyska commented on SPARK-5389:
--

C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4>where find
C:\Windows\System32\find.exe

C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4>where findstr
C:\Windows\System32\findstr.exe

C:\Users\ykadiysk\Downloads\spark-1.2.0-bin-cdh4>echo %PATH%
C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
 Files (x86)\Enterprise Vault\EVClient\;C:\Program Files (x86)\Git\cmd
;C:\Program Files (x86)\Perforce;C:\Program Files\MiKTeX 
2.9\miktex\bin\x64\;C:\Program Files\Java\jdk1.7.0_40\bin;C:\Program Files 
(x86)\sbt\\bin;C:\Program Files (x86)\scala\bin;
C:\apache-maven-3.1.0\bin;C:\Program Files\Java\jre7\bin\server;"c:\Program 
Files\R\R-3.0.2"\bin;C:\Python27

> spark-shell.cmd does not run from DOS Windows 7
> ---
>
> Key: SPARK-5389
> URL: https://issues.apache.org/jira/browse/SPARK-5389
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.2.0
> Environment: Windows 7
>Reporter: Yana Kadiyska
> Attachments: SparkShell_Win7.JPG
>
>
> spark-shell.cmd crashes in DOS prompt Windows 7. Works fine under PowerShell. 
> spark-shell.cmd works fine for me in v.1.1 so this is new in spark1.2
> Marking as trivial since calling spark-shell2.cmd also works fine
> Attaching a screenshot since the error isn't very useful:
> {code}
> spark-1.2.0-bin-cdh4>bin\spark-shell.cmd
> else was unexpected at this time.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6294) PySpark task may hang while call take() on in Java/Scala

2015-03-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6294.
--
   Resolution: Fixed
Fix Version/s: (was: 1.3.1)
   (was: 1.4.0)
   1.2.2

Issue resolved by pull request 5003
[https://github.com/apache/spark/pull/5003]

> PySpark task may hang while call take() on in Java/Scala
> 
>
> Key: SPARK-6294
> URL: https://issues.apache.org/jira/browse/SPARK-6294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.2
>
>
> {code}
> >>> rdd = sc.parallelize(range(1<<20)).map(lambda x: str(x))
> >>> rdd._jrdd.first()
> {code}
> There is the stacktrace while hanging:
> {code}
> "Executor task launch worker-5" daemon prio=10 tid=0x7f8fd01a9800 
> nid=0x566 in Object.wait() [0x7f90481d7000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x000630929340> (a 
> org.apache.spark.api.python.PythonRDD$WriterThread)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x000630929340> (a 
> org.apache.spark.api.python.PythonRDD$WriterThread)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:78)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:76)
>   at 
> org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:58)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6268) KMeans parameter getter methods

2015-03-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6268.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4974
[https://github.com/apache/spark/pull/4974]

> KMeans parameter getter methods
> ---
>
> Key: SPARK-6268
> URL: https://issues.apache.org/jira/browse/SPARK-6268
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.4.0
>
>
> KMeans has many setters for parameters.  It should have matching getters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks

2015-03-12 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359524#comment-14359524
 ] 

Reynold Xin commented on SPARK-6190:


If I can guarantee at the block manager level, all large blocks are chunked 
into smaller ones less than 2G, then there is no reason to support +2GB blocks 
at the block manager level. 

This affects the very core of Spark. It is important to think about how this 
will affect the long term Spark evolution (including explicit memory 
management, operating directly against records in the form of raw bytes, etc), 
rather than just rushing in, patching individual problems and leading to a 
codebase that has tons of random abstractions. 

On a separate topic, based on your design doc, LargeByteBuffer is still read 
only. There is no interface for LargeByteBufferOutputStream to even write to 
LargeByteBuffer. Can you include that? 


> create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
> --
>
> Key: SPARK-6190
> URL: https://issues.apache.org/jira/browse/SPARK-6190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Attachments: LargeByteBuffer.pdf
>
>
> A key component in eliminating the 2GB limit on blocks is creating a proper 
> abstraction for storing more than 2GB.  Currently spark is limited by a 
> reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at 
> 2GB.  This task will introduce the new abstraction and the relevant 
> implementation and utilities, without effecting the existing implementation 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5622) Add connector/handler hive configuration settings to hive-thrift-server

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5622.
--
Resolution: Won't Fix

This sounded more clearly like a WontFix from the PR.

> Add connector/handler hive configuration settings to hive-thrift-server
> ---
>
> Key: SPARK-5622
> URL: https://issues.apache.org/jira/browse/SPARK-5622
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0, 1.1.1
>Reporter: Alex Liu
>
> When integrate Cassandra Storage handler to Spark SQL, we need pass some 
> configuration settings to Hive-thrift-server hiveConf during server starting 
> process.
> e.g.
> {code}
> ./sbin/start-thriftserver.sh  --hiveconf cassandra.username=cassandra 
> --hiveconf cassandra.password=cassandra
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6307) Executers fetches the same rdd-block 100's or 1000's of times

2015-03-12 Thread Tobias Bertelsen (JIRA)

Tobias Bertelsen created SPARK-6307:
---

 Summary: Executers fetches the same rdd-block 100's or 1000's of 
times
 Key: SPARK-6307
 URL: https://issues.apache.org/jira/browse/SPARK-6307
 Project: Spark
  Issue Type: Bug
Affects Versions: 2+
 Environment: Linux, Spark Standalone 2.10, running in a PBS grid engine
Reporter: Tobias Bertelsen


The block manager keept fetching the same blocks over and over, making tasks 
with network activity extremely slow. Two identical tasks can take between 12 
seconds up to more than an hour. (where I stopped it).

Spark should cache the blocks, so it does not fetch the same blocks over, and 
over, and over.

Here is a simplified version of the code that provokes it:

{code}
// Read a few thousand lines (~ 15 MB)
val fileContents = sc.newAPIHadoopFile(path, ..).repartition(16)
val data = fileContents.map{x => parseContent(x)}.cache()
// Do a pairwise comparison and count the best pairs
val pairs = data.cartesian(data).filter { case ((x,y) =>
  similarity(x, y) > 0.9
}
pairs.count()
{code}

This is a tiny fraction of one of the worker's stderr:

{code}
15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely
15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_2 remotely
15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_1 remotely
15/03/12 21:55:09 INFO BlockManager: Found block rdd_8_0 remotely

Thousands more lines, fetching the same 16 remote blocks

15/03/12 22:25:44 INFO BlockManager: Found block rdd_8_0 remotely
15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
15/03/12 22:25:45 INFO BlockManager: Found block rdd_8_0 remotely
{code}

h2. Details for that stage from the UI.

 - *Total task time across all tasks:* 11.9 h
 - *Input:* 2.2 GB
 - *Shuffle read:* 4.5 MB


h3. Summary Metrics for 176 Completed Tasks

|| Metric || Min || 25th percentile || Median || 75th percentile || Max ||
| Duration | 7 s | 8 s | 8 s | 12 s | 59 min |
| GC Time | 0 ms | 99 ms | 0.1 s | 0.2 s | 0.5 s |
| Input | 6.9 MB | 8.2 MB | 8.4 MB | 9.0 MB | 11.0 MB |
| Shuffle Read (Remote) | 0.0 B | 0.0 B | 0.0 B | 0.0 B | 676.6 KB |



h3. Aggregated Metrics by Executor

|| Executor ID || Address || Task Time || Total Tasks || Failed Tasks || 
Succeeded Tasks || Input || Output || Shuffle Read || Shuffle Write || Shuffle 
Spill (Memory) || Shuffle Spill (Disk) ||
| 0 | n-62-23-3:49566 | 5.7 h | 9 | 0 | 9 | 171.0 MB | 0.0 B | 0.0 B | 0.0 B | 
0.0 B | 0.0 B |
| 1 | n-62-23-6:57518 | 16.4 h | 20 | 0 | 20 | 169.9 MB | 0.0 B | 0.0 B | 0.0 B 
| 0.0 B | 0.0 B |
| 2 | n-62-18-48:33551 | 0 ms | 0 | 0 | 0 | 169.6 MB | 0.0 B | 0.0 B | 0.0 B | 
0.0 B | 0.0 B |
| 3 | n-62-23-5:58421 | 2.9 min | 12 | 0 | 12 | 266.2 MB | 0.0 B | 4.5 MB | 0.0 
B | 0.0 B | 0.0 B |
| 4 | n-62-23-1:40096 | 23 min | 164 | 0 | 164 | 1430.4 MB | 0.0 B | 0.0 B | 
0.0 B | 0.0 B | 0.0 B |




h3. Tasks

|| Index || ID || Attempt || Status || Locality Level || Executor ID / Host || 
Launch Time || Duration || GC Time || Input || Shuffle Read || Errors ||
| 1 | 2 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s | 0.1 
s | 6.9 MB (memory) | 676.6 KB || 
| 0 | 1 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min | 
0.3 s | 8.7 MB (network) | 0.0 B || 
| 4 | 5 | 0 | SUCCESS | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 38 min | 
0.4 s | 8.6 MB (network) | 0.0 B || 
| 3 | 4 | 0 | RUNNING | ANY | 2 / n-62-18-48 | 2015/03/12 21:55:00 | 55 min |  
| 8.3 MB (network) | 0.0 B || 
| 2 | 3 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 11 s | 0.3 
s | 8.4 MB (memory) | 0.0 B || 
| 7 | 8 | 0 | SUCCESS | ANY | 4 / n-62-23-1 | 2015/03/12 21:55:00 | 12 s | 0.3 
s | 9.2 MB (memory) | 0.0 B || 
| 6 | 7 | 0 | SUCCESS | ANY | 3 / n-62-23-5 | 2015/03/12 21:55:00 | 12 s | 0.1 
s | 8.1 MB (memory) | 0.0 B || 
| 5 | 6 | 0 | SUCCESS | ANY | 0 / n-62-23-3 | 2015/03/12 21:55:00 | 39 min | 
0.3 s | 8.6 MB (network) | 0.0 B || 
| 9 | 10 | 0 | RUNNING | ANY | 1 / n-62-23-6 | 2015/03/12 21:55:00 | 55 min |  
| 8.7 MB (network) | 0.0 B || 










--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5740) Change comment default value from empty string to "null" in DescribeCommand

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5740:
-
Priority: Minor  (was: Major)
Target Version/s:   (was: 1.4.0)
   Fix Version/s: (was: 1.3.0)

Given the PR discussion, is this WontFix? i wasn't 100% sure.

> Change comment default value from empty string to "null" in DescribeCommand
> ---
>
> Key: SPARK-5740
> URL: https://issues.apache.org/jira/browse/SPARK-5740
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Li Sheng
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Change comment default value from empty string to "null" in DescribeCommand



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359437#comment-14359437
 ] 

Sean Owen commented on SPARK-4927:
--

OK, behavior looks a little different on YARN. I find memory usage, however, 
stabilizes quickly. For example with 2 executors / 512M / 1 core each, they 
show 461 and 467 MB free, +/- 1MB. With 5 executors, 8 cores, 512MB, about 
500MB is free very consistently over time.

Maybe it was resolved at some point?

> Spark does not clean up properly during long jobs. 
> ---
>
> Key: SPARK-4927
> URL: https://issues.apache.org/jira/browse/SPARK-4927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Ilya Ganelin
>
> On a long running Spark job, Spark will eventually run out of memory on the 
> driver node due to metadata overhead from the shuffle operation. Spark will 
> continue to operate, however with drastically decreased performance (since 
> swapping now occurs with every operation).
> The spark.cleanup.tll parameter allows a user to configure when cleanup 
> happens but the issue with doing this is that it isn’t done safely, e.g. If 
> this clears a cached RDD or active task in the middle of processing a stage, 
> this ultimately causes a KeyNotFoundException when the next stage attempts to 
> reference the cleared RDD or task.
> There should be a sustainable mechanism for cleaning up stale metadata that 
> allows the program to continue running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-12 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359404#comment-14359404
 ] 

Nicholas Chammas commented on SPARK-6282:
-

Shouldn't be related to boto. "_winreg" appears to be something Python uses to 
access the Windows registry, which is strange.

Please give us more details about your cluster setup, where you are running the 
driver from, etc. Also, what if you try using numpy's implementation of 
{{random}}?

> Strange Python import error when using random() in a lambda function
> 
>
> Key: SPARK-6282
> URL: https://issues.apache.org/jira/browse/SPARK-6282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Kubuntu 14.04, Python 2.7.6
>Reporter: Pavel Laskov
>Priority: Minor
>
> Consider the exemplary Python code below:
>from random import random
>from pyspark.context import SparkContext
>from xval_mllib import read_csv_file_as_list
> if __name__ == "__main__": 
> sc = SparkContext(appName="Random() bug test")
> data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
> #data = sc.parallelize([1, 2, 3, 4, 5], 2)
> d = data.map(lambda x: (random(), x))
> print d.first()
> Data is read from a large CSV file. Running this code results in a Python 
> import error:
> ImportError: No module named _winreg
> If I use 'import random' and 'random.random()' in the lambda function no 
> error occurs. Also no error occurs, for both kinds of import statements, for 
> a small artificial data set like the one shown in a commented line.  
> The full error trace, the source code of csv reading code (function 
> 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
> dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1673) GLMNET implementation in Spark

2015-03-12 Thread mike bowles (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359366#comment-14359366
 ] 

mike bowles commented on SPARK-1673:


Here's a table of scaling results for our implementation of glmnet regression.  
These are run locally on a 4-core server.  The data set is the higgs boson data 
set (available on aws).  We measured training times for various numbers of rows 
of data from 1000 to 10 million.  The attribute space is 28 variables wide.  We 
ran on 1 through 4 cores on the server.  

Training times (sec)
#rows   1-core  2-core  3-core  4-cores
100K4.883.793.413.48
1M  20.510.69.518.45
5M  71.237.126.725.5
10M 155 70.559.749.7
The structure of the algorithm suggests that the training times should be 
linear in the number of rows and the test results bear that out.  Two cores 
shows a speedup of ~2 over one core and three cores shows ~2.6 over one core 
and four cores speeds up by ~3.11.  The four core result probably lags due to 
conflict with system function etc.  Running on AWS will make that clearer.  
That's in process now.

Our next steps are 
1.  run on some wider data sets
2.  run on larger cluster
3.  run OWLQN on the same problems in the same setting
4.  experiment with speedups - Joseph Bradley's approximation idea and cutting 
the number of data passes down by predicting what variables are going to become 
active instead of waiting until they do.  

> GLMNET implementation in Spark
> --
>
> Key: SPARK-1673
> URL: https://issues.apache.org/jira/browse/SPARK-1673
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Sung Chung
>
> This is a Spark implementation of GLMNET by Jerome Friedman, Trevor Hastie, 
> Rob Tibshirani.
> http://www.jstatsoft.org/v33/i01/paper
> It's a straightforward implementation of the Coordinate-Descent based L1/L2 
> regularized linear models, including Linear/Logistic/Multinomial regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359362#comment-14359362
 ] 

Sean Owen commented on SPARK-6282:
--

[~nchammas] or [~shivaram] might have a clue if it distantly relates to boto.

> Strange Python import error when using random() in a lambda function
> 
>
> Key: SPARK-6282
> URL: https://issues.apache.org/jira/browse/SPARK-6282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Kubuntu 14.04, Python 2.7.6
>Reporter: Pavel Laskov
>Priority: Minor
>
> Consider the exemplary Python code below:
>from random import random
>from pyspark.context import SparkContext
>from xval_mllib import read_csv_file_as_list
> if __name__ == "__main__": 
> sc = SparkContext(appName="Random() bug test")
> data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
> #data = sc.parallelize([1, 2, 3, 4, 5], 2)
> d = data.map(lambda x: (random(), x))
> print d.first()
> Data is read from a large CSV file. Running this code results in a Python 
> import error:
> ImportError: No module named _winreg
> If I use 'import random' and 'random.random()' in the lambda function no 
> error occurs. Also no error occurs, for both kinds of import statements, for 
> a small artificial data set like the one shown in a commented line.  
> The full error trace, the source code of csv reading code (function 
> 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
> dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359336#comment-14359336
 ] 

Joseph K. Bradley commented on SPARK-6282:
--

It looks like "winreg" is referenced in Spark's dependencies (specifically, 
"boto" which is used for ec2).  I'm not very familiar with that part, and it's 
strange to me that it's ML-specific.  If others here aren't sure, I'd try 
asking on the user list.

> Strange Python import error when using random() in a lambda function
> 
>
> Key: SPARK-6282
> URL: https://issues.apache.org/jira/browse/SPARK-6282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Kubuntu 14.04, Python 2.7.6
>Reporter: Pavel Laskov
>Priority: Minor
>
> Consider the exemplary Python code below:
>from random import random
>from pyspark.context import SparkContext
>from xval_mllib import read_csv_file_as_list
> if __name__ == "__main__": 
> sc = SparkContext(appName="Random() bug test")
> data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
> #data = sc.parallelize([1, 2, 3, 4, 5], 2)
> d = data.map(lambda x: (random(), x))
> print d.first()
> Data is read from a large CSV file. Running this code results in a Python 
> import error:
> ImportError: No module named _winreg
> If I use 'import random' and 'random.random()' in the lambda function no 
> error occurs. Also no error occurs, for both kinds of import statements, for 
> a small artificial data set like the one shown in a commented line.  
> The full error trace, the source code of csv reading code (function 
> 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
> dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.

2015-03-12 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359294#comment-14359294
 ] 

Ilya Ganelin commented on SPARK-4927:
-

Are you running over yarn? My theory is that the memory usage has to do with 
data movement between nodes.



Sent with Good (www.good.com)




> Spark does not clean up properly during long jobs. 
> ---
>
> Key: SPARK-4927
> URL: https://issues.apache.org/jira/browse/SPARK-4927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Ilya Ganelin
>
> On a long running Spark job, Spark will eventually run out of memory on the 
> driver node due to metadata overhead from the shuffle operation. Spark will 
> continue to operate, however with drastically decreased performance (since 
> swapping now occurs with every operation).
> The spark.cleanup.tll parameter allows a user to configure when cleanup 
> happens but the issue with doing this is that it isn’t done safely, e.g. If 
> this clears a cached RDD or active task in the middle of processing a stage, 
> this ultimately causes a KeyNotFoundException when the next stage attempts to 
> reference the cleared RDD or task.
> There should be a sustainable mechanism for cleaning up stale metadata that 
> allows the program to continue running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3424) KMeans Plus Plus is too slow

2015-03-12 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359241#comment-14359241
 ] 

Xiangrui Meng commented on SPARK-3424:
--

Ah, sorry! I typed your email manually in the commit message but I missed "r". 
The commit message is immutable, so I cannot update it now. I'll be more 
careful next time.

> KMeans Plus Plus is too slow
> 
>
> Key: SPARK-3424
> URL: https://issues.apache.org/jira/browse/SPARK-3424
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Derrick Burns
>Assignee: Derrick Burns
> Fix For: 1.3.0
>
>
> The  KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is 
> the rounds of the KMeansParallel algorithm and k is the number of clusters.  
> This can be dramatically improved by maintaining the distance the closest 
> cluster center from round to round and then incrementally updating that value 
> for each point. This incremental update is O(1) time, this reduces the 
> running time for K Means Plus Plus to O( m k ).  For large k, this is 
> significant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4001) Add FP-growth algorithm to Spark MLlib

2015-03-12 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4001:
-
Summary: Add FP-growth algorithm to Spark MLlib  (was: Add Apriori 
algorithm to Spark MLlib)

> Add FP-growth algorithm to Spark MLlib
> --
>
> Key: SPARK-4001
> URL: https://issues.apache.org/jira/browse/SPARK-4001
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jacky Li
>Assignee: Jacky Li
> Fix For: 1.3.0
>
> Attachments: Distributed frequent item mining algorithm based on 
> Spark.pptx
>
>
> Apriori is the classic algorithm for frequent item set mining in a 
> transactional data set.  It will be useful if Apriori algorithm is added to 
> MLLib in Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359200#comment-14359200
 ] 

Sean Owen commented on SPARK-4927:
--

Yes that's what I'm running in spark-shell (plus imports, and removing that log 
that didn't compile for some reason). I don't see decreasing memory available.

> Spark does not clean up properly during long jobs. 
> ---
>
> Key: SPARK-4927
> URL: https://issues.apache.org/jira/browse/SPARK-4927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Ilya Ganelin
>
> On a long running Spark job, Spark will eventually run out of memory on the 
> driver node due to metadata overhead from the shuffle operation. Spark will 
> continue to operate, however with drastically decreased performance (since 
> swapping now occurs with every operation).
> The spark.cleanup.tll parameter allows a user to configure when cleanup 
> happens but the issue with doing this is that it isn’t done safely, e.g. If 
> this clears a cached RDD or active task in the middle of processing a stage, 
> this ultimately causes a KeyNotFoundException when the next stage attempts to 
> reference the cleared RDD or task.
> There should be a sustainable mechanism for cleaning up stale metadata that 
> allows the program to continue running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-03-12 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359166#comment-14359166
 ] 

Patrick Wendell commented on SPARK-5654:


I see the decision here as somewhat orthogonal to vendors and vendor packaging. 
Vendors can chose whether to package this component or not, and some may leave 
it out until it gets more mature. Of course, they are more encouraged/pressured 
to package things that end up inside the project itself, but that could be used 
to justify merging all kinds of random stuff into Spark, so I don't think it's 
a sufficient justification.

The main argument as I said before is just that non-JVM language API's are 
really just not possible to maintain outside of the project, because it's not 
building on any even remotely "public" API. Imagine if we tried to have PySpark 
as it's own project, it is so tightly coupled that it wouldn't work.

I have argued in the past for things to existing outside the project when they 
can, and that I still promote that strongly.

> Integrate SparkR into Apache Spark
> --
>
> Key: SPARK-5654
> URL: https://issues.apache.org/jira/browse/SPARK-5654
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Shivaram Venkataraman
>
> The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
> from R. The project was started at the AMPLab around a year ago and has been 
> incubated as its own project to make sure it can be easily merged into 
> upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
> goals are similar to PySpark and shares a similar design pattern as described 
> in our meetup talk[2], Spark Summit presentation[3].
> Integrating SparkR into the Apache project will enable R users to use Spark 
> out of the box and given R’s large user base, it will help the Spark project 
> reach more users.  Additionally, work in progress features like providing R 
> integration with ML Pipelines and Dataframes can be better achieved by 
> development in a unified code base.
> SparkR is available under the Apache 2.0 License and does not have any 
> external dependencies other than requiring users to have R and Java installed 
> on their machines.  SparkR’s developers come from many organizations 
> including UC Berkeley, Alteryx, Intel and we will support future development, 
> maintenance after the integration.
> [1] https://github.com/amplab-extras/SparkR-pkg
> [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
> [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4927) Spark does not clean up properly during long jobs.

2015-03-12 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358958#comment-14358958
 ] 

Ilya Ganelin edited comment on SPARK-4927 at 3/12/15 6:50 PM:
--

Hi Sean - I have a code snippet that reproduced this. Let me send it to you in 
a bit - I don't have the means to run 1.3 in a cluster.

Realized that I already had that code snippet posted. Running the above code 
doesn't reproduce the issue?



was (Author: ilganeli):
Hi Sean - I have a code snippet that reproduced this. Let me send it to you in 
a bit - I don't have the means to run 1.3 in a cluster.



Sent with Good (www.good.com)




> Spark does not clean up properly during long jobs. 
> ---
>
> Key: SPARK-4927
> URL: https://issues.apache.org/jira/browse/SPARK-4927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Ilya Ganelin
>
> On a long running Spark job, Spark will eventually run out of memory on the 
> driver node due to metadata overhead from the shuffle operation. Spark will 
> continue to operate, however with drastically decreased performance (since 
> swapping now occurs with every operation).
> The spark.cleanup.tll parameter allows a user to configure when cleanup 
> happens but the issue with doing this is that it isn’t done safely, e.g. If 
> this clears a cached RDD or active task in the middle of processing a stage, 
> this ultimately causes a KeyNotFoundException when the next stage attempts to 
> reference the cleared RDD or task.
> There should be a sustainable mechanism for cleaning up stale metadata that 
> allows the program to continue running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4012) Uncaught OOM in ContextCleaner

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359062#comment-14359062
 ] 

Apache Spark commented on SPARK-4012:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/5004

> Uncaught OOM in ContextCleaner
> --
>
> Key: SPARK-4012
> URL: https://issues.apache.org/jira/browse/SPARK-4012
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>
> When running an "might-be-memory-intensive"  application locally, I received 
> the following exception
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Spark Context Cleaner"
> Java HotSpot(TM) 64-Bit Server VM warning: Exception 
> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the 
> VM may need to be forcibly terminated
> Exception: java.lang.OutOfMemoryError thrown from the 
> UncaughtExceptionHandler in thread "Driver Heartbeater"
> Java HotSpot(TM) 64-Bit Server VM warning: Exception 
> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- 
> the VM may need to be forcibly terminated
> Java HotSpot(TM) 64-Bit Server VM warning: Exception 
> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the 
> VM may need to be forcibly terminated
> Java HotSpot(TM) 64-Bit Server VM warning: Exception 
> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the 
> VM may need to be forcibly terminated
> Java HotSpot(TM) 64-Bit Server VM warning: Exception 
> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the 
> VM may need to be forcibly terminated
> Java HotSpot(TM) 64-Bit Server VM warning: Exception 
> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the 
> VM may need to be forcibly terminated
> Java HotSpot(TM) 64-Bit Server VM warning: Exception 
> java.lang.OutOfMemoryError occurred dispatching signal SIGINT to handler- the 
> VM may need to be forcibly terminated
> I looked at the code, we might want to call Utils.tryOrExit instead of 
> Utils.logUncaughtExceptions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359027#comment-14359027
 ] 

Sean Owen commented on SPARK-1564:
--

Yeah that's what I did, just made it not tied to the old 1.0 parent issue.

> Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
> -
>
> Key: SPARK-1564
> URL: https://issues.apache.org/jira/browse/SPARK-1564
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Andrew Or
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-03-12 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359017#comment-14359017
 ] 

Matei Zaharia commented on SPARK-1564:
--

This is still a valid issue AFAIK, isn't it? These things still show up badly 
in Javadoc. So we could change the parent issue or something but I'd like to 
see it fixed.

> Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
> -
>
> Key: SPARK-1564
> URL: https://issues.apache.org/jira/browse/SPARK-1564
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Andrew Or
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6294) PySpark task may hang while call take() on in Java/Scala

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359016#comment-14359016
 ] 

Apache Spark commented on SPARK-6294:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/5003

> PySpark task may hang while call take() on in Java/Scala
> 
>
> Key: SPARK-6294
> URL: https://issues.apache.org/jira/browse/SPARK-6294
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.4.0, 1.3.1
>
>
> {code}
> >>> rdd = sc.parallelize(range(1<<20)).map(lambda x: str(x))
> >>> rdd._jrdd.first()
> {code}
> There is the stacktrace while hanging:
> {code}
> "Executor task launch worker-5" daemon prio=10 tid=0x7f8fd01a9800 
> nid=0x566 in Object.wait() [0x7f90481d7000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x000630929340> (a 
> org.apache.spark.api.python.PythonRDD$WriterThread)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x000630929340> (a 
> org.apache.spark.api.python.PythonRDD$WriterThread)
>   at java.lang.Thread.join(Thread.java:1355)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:78)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$compute$1.apply(PythonRDD.scala:76)
>   at 
> org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
>   at 
> org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:58)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5310) Update SQL programming guide for 1.3

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358981#comment-14358981
 ] 

Apache Spark commented on SPARK-5310:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5001

> Update SQL programming guide for 1.3
> 
>
> Key: SPARK-5310
> URL: https://issues.apache.org/jira/browse/SPARK-5310
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> We make quite a few changes. We should update the SQL programming guide to 
> reflect these changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358964#comment-14358964
 ] 

Apache Spark commented on SPARK-6286:
-

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/5000

> Handle TASK_ERROR in TaskState
> --
>
> Key: SPARK-6286
> URL: https://issues.apache.org/jira/browse/SPARK-6286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Iulian Dragos
>Priority: Minor
>  Labels: mesos
>
> Scala warning:
> {code}
> match may not be exhaustive. It would fail on the following input: TASK_ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.

2015-03-12 Thread Ilya Ganelin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358958#comment-14358958
 ] 

Ilya Ganelin commented on SPARK-4927:
-

Hi Sean - I have a code snippet that reproduced this. Let me send it to you in 
a bit - I don't have the means to run 1.3 in a cluster.



Sent with Good (www.good.com)




> Spark does not clean up properly during long jobs. 
> ---
>
> Key: SPARK-4927
> URL: https://issues.apache.org/jira/browse/SPARK-4927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Ilya Ganelin
>
> On a long running Spark job, Spark will eventually run out of memory on the 
> driver node due to metadata overhead from the shuffle operation. Spark will 
> continue to operate, however with drastically decreased performance (since 
> swapping now occurs with every operation).
> The spark.cleanup.tll parameter allows a user to configure when cleanup 
> happens but the issue with doing this is that it isn’t done safely, e.g. If 
> this clears a cached RDD or active task in the middle of processing a stage, 
> this ultimately causes a KeyNotFoundException when the next stage attempts to 
> reference the cleared RDD or task.
> There should be a sustainable mechanism for cleaning up stale metadata that 
> allows the program to continue running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4927) Spark does not clean up properly during long jobs.

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358933#comment-14358933
 ] 

Sean Owen commented on SPARK-4927:
--

I'm interested in this one. When I run it though it holds steady though:

15/03/12 16:37:17 INFO MemoryStore: Block broadcast_29480_piece0 stored as 
bytes in memory (estimated size 1395.0 B, free 133.2 MB)

It's always 133.1 or 133.2 MB for me. I wonder if you can still reproduce this 
on 1.3?

> Spark does not clean up properly during long jobs. 
> ---
>
> Key: SPARK-4927
> URL: https://issues.apache.org/jira/browse/SPARK-4927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Ilya Ganelin
>
> On a long running Spark job, Spark will eventually run out of memory on the 
> driver node due to metadata overhead from the shuffle operation. Spark will 
> continue to operate, however with drastically decreased performance (since 
> swapping now occurs with every operation).
> The spark.cleanup.tll parameter allows a user to configure when cleanup 
> happens but the issue with doing this is that it isn’t done safely, e.g. If 
> this clears a cached RDD or active task in the middle of processing a stage, 
> this ultimately causes a KeyNotFoundException when the next stage attempts to 
> reference the cleared RDD or task.
> There should be a sustainable mechanism for cleaning up stale metadata that 
> allows the program to continue running. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState

2015-03-12 Thread Iulian Dragos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358905#comment-14358905
 ] 

Iulian Dragos commented on SPARK-6286:
--

Sure, I'll issue a PR for handling {{TASK_ERROR => TASK_LOST}}

> Handle TASK_ERROR in TaskState
> --
>
> Key: SPARK-6286
> URL: https://issues.apache.org/jira/browse/SPARK-6286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Iulian Dragos
>Priority: Minor
>  Labels: mesos
>
> Scala warning:
> {code}
> match may not be exhaustive. It would fail on the following input: TASK_ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1546) Add AdaBoost algorithm to Spark MLlib

2015-03-12 Thread Manish Amde (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358881#comment-14358881
 ] 

Manish Amde commented on SPARK-1546:


I haven't worked on it since we haven't heard a need for it post RF and GBT 
work. :-) 

This might be best done after the API standardization work on 
https://issues.apache.org/jira/browse/SPARK-6113

> Add AdaBoost algorithm to Spark MLlib
> -
>
> Key: SPARK-1546
> URL: https://issues.apache.org/jira/browse/SPARK-1546
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Manish Amde
>Assignee: Manish Amde
>
> This task requires adding the AdaBoost algorithm to Spark MLlib. The 
> implementation needs to adapt the classic AdaBoost algorithm to the scalable 
> tree implementation.
> The tasks involves:
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6300) sc.addFile(path) does not support the relative path.

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358877#comment-14358877
 ] 

Sean Owen commented on SPARK-6300:
--

(Sandy notes it's a regression so yeah it's more important. I didn't think this 
was ever supposed to work)

> sc.addFile(path) does not support the relative path.
> 
>
> Key: SPARK-6300
> URL: https://issues.apache.org/jira/browse/SPARK-6300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.2.1
>Reporter: DoingDone9
>Assignee: DoingDone9
>Priority: Critical
>
> when i run cmd like that sc.addFile("../test.txt"), it did not work and throw 
> an exception
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: file:../test.txt
> at org.apache.hadoop.fs.Path.initialize(Path.java:206)
> at org.apache.hadoop.fs.Path.(Path.java:172) 
> 
> ...
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> file:../test.txt
> at java.net.URI.checkPath(URI.java:1804)
> at java.net.URI.(URI.java:752)
> at org.apache.hadoop.fs.Path.initialize(Path.java:203)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6299) ClassNotFoundException when running groupByKey with class defined in REPL.

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6299:
-
Component/s: Spark Shell

> ClassNotFoundException when running groupByKey with class defined in REPL.
> --
>
> Key: SPARK-6299
> URL: https://issues.apache.org/jira/browse/SPARK-6299
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kevin (Sangwoo) Kim
>Priority: Critical
>
> Anyone can reproduce this issue by the code below
> (runs well in local mode, got exception with clusters)
> (it runs well in Spark 1.1.1)
> case class ClassA(value: String)
> val rdd = sc.parallelize(List(("k1", ClassA("v1")), ("k1", ClassA("v2")) ))
> rdd.groupByKey.collect
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 
> in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 
> 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): 
> java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:274)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
> at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at 
> org.apache.spark.scheduler.DAGSched

[jira] [Updated] (SPARK-6300) sc.addFile(path) does not support the relative path.

2015-03-12 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-6300:
--
Priority: Critical  (was: Minor)
Target Version/s: 1.3.1

> sc.addFile(path) does not support the relative path.
> 
>
> Key: SPARK-6300
> URL: https://issues.apache.org/jira/browse/SPARK-6300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.2.1
>Reporter: DoingDone9
>Assignee: DoingDone9
>Priority: Critical
>
> when i run cmd like that sc.addFile("../test.txt"), it did not work and throw 
> an exception
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: file:../test.txt
> at org.apache.hadoop.fs.Path.initialize(Path.java:206)
> at org.apache.hadoop.fs.Path.(Path.java:172) 
> 
> ...
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> file:../test.txt
> at java.net.URI.checkPath(URI.java:1804)
> at java.net.URI.(URI.java:752)
> at org.apache.hadoop.fs.Path.initialize(Path.java:203)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6273) Got error when one table's alias name is the same with other table's column name

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6273:
-
Component/s: SQL
Description: 
while one table's alias name is the same with other table's column name
get the error Ambiguous references

{code}
Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Ambiguous references to salary.pay_date: 
(pay_date#34749,List()),(salary#34792,List(pay_date)), tree:
'Filter 'salary.pay_date = 'time_by_day.the_date) && ('time_by_day.the_year 
= 1997.0)) && ('salary.employee_id = 'employee.employee_id)) && 
('employee.store_id = 'store.store_id))
 Join Inner, None
  Join Inner, None
   Join Inner, None
MetastoreRelation yxqtest, time_by_day, Some(time_by_day)
MetastoreRelation yxqtest, salary, Some(salary)
   MetastoreRelation yxqtest, store, Some(store)
  MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0)
Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Ambiguous references to salary.pay_date: 
(pay_date#34749,List()),(salary#34792,List(pay_date)), tree:
'Filter 'salary.pay_date = 'time_by_day.the_date) && ('time_by_day.the_year 
= 1997.0)) && ('salary.employee_id = 'employee.employee_id)) && 
('employee.store_id = 'store.store_id))
 Join Inner, None
  Join Inner, None
   Join Inner, None
MetastoreRelation yxqtest, time_by_day, Some(time_by_day)
MetastoreRelation yxqtest, salary, Some(salary)
   MetastoreRelation yxqtest, store, Some(store)
  MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0)
{code}


  was:
while one table's alias name is the same with other table's column name
get the error Ambiguous references

Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Ambiguous references to salary.pay_date: 
(pay_date#34749,List()),(salary#34792,List(pay_date)), tree:
'Filter 'salary.pay_date = 'time_by_day.the_date) && ('time_by_day.the_year 
= 1997.0)) && ('salary.employee_id = 'employee.employee_id)) && 
('employee.store_id = 'store.store_id))
 Join Inner, None
  Join Inner, None
   Join Inner, None
MetastoreRelation yxqtest, time_by_day, Some(time_by_day)
MetastoreRelation yxqtest, salary, Some(salary)
   MetastoreRelation yxqtest, store, Some(store)
  MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0)
Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
Ambiguous references to salary.pay_date: 
(pay_date#34749,List()),(salary#34792,List(pay_date)), tree:
'Filter 'salary.pay_date = 'time_by_day.the_date) && ('time_by_day.the_year 
= 1997.0)) && ('salary.employee_id = 'employee.employee_id)) && 
('employee.store_id = 'store.store_id))
 Join Inner, None
  Join Inner, None
   Join Inner, None
MetastoreRelation yxqtest, time_by_day, Some(time_by_day)
MetastoreRelation yxqtest, salary, Some(salary)
   MetastoreRelation yxqtest, store, Some(store)
  MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0)

Summary: Got error when one table's alias name is the same with other 
table's column name  (was: Got error when do join)

(Make the title more descriptive and add a component)

> Got error when one table's alias name is the same with other table's column 
> name
> 
>
> Key: SPARK-6273
> URL: https://issues.apache.org/jira/browse/SPARK-6273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: Jeff
>
> while one table's alias name is the same with other table's column name
> get the error Ambiguous references
> {code}
> Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Ambiguous references to salary.pay_date: 
> (pay_date#34749,List()),(salary#34792,List(pay_date)), tree:
> 'Filter 'salary.pay_date = 'time_by_day.the_date) && 
> ('time_by_day.the_year = 1997.0)) && ('salary.employee_id = 
> 'employee.employee_id)) && ('employee.store_id = 'store.store_id))
>  Join Inner, None
>   Join Inner, None
>Join Inner, None
> MetastoreRelation yxqtest, time_by_day, Some(time_by_day)
> MetastoreRelation yxqtest, salary, Some(salary)
>MetastoreRelation yxqtest, store, Some(store)
>   MetastoreRelation yxqtest, employee, Some(employee) (state=,code=0)
> Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Ambiguous references to salary.pay_date: 
> (pay_date#34749,List()),(salary#34792,List(pay_date)), tree:
> 'Filter 'salary.pay_date = 'time_by_day.the_date) && 
> ('time_by_day.the_year = 1997.0)) && ('salary.employee_id = 
> 'employee.employee_id)) && ('employee.store_id = 'store.store_id))
>  Join Inner, None
>   Join Inner, None
>Join Inner, None
> MetastoreRelation yxqtest, time_by_day, Some(time_by_day)
> MetastoreRelation yx

[jira] [Commented] (SPARK-1548) Add Partial Random Forest algorithm to MLlib

2015-03-12 Thread Manish Amde (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358871#comment-14358871
 ] 

Manish Amde commented on SPARK-1548:


We should also leave this ticket unassigned for somebody else to pick up 
if/when interested.

> Add Partial Random Forest algorithm to MLlib
> 
>
> Key: SPARK-1548
> URL: https://issues.apache.org/jira/browse/SPARK-1548
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>Assignee: Frank Dai
>
> This task involves creating an alternate approximate random forest 
> implementation where each tree is constructed per partition.
> The tasks involves:
> - Justifying with theory and experimental results why this algorithm is a 
> good choice.
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1548) Add Partial Random Forest algorithm to MLlib

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1548:
-
Assignee: (was: Frank Dai)

> Add Partial Random Forest algorithm to MLlib
> 
>
> Key: SPARK-1548
> URL: https://issues.apache.org/jira/browse/SPARK-1548
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Manish Amde
>
> This task involves creating an alternate approximate random forest 
> implementation where each tree is constructed per partition.
> The tasks involves:
> - Justifying with theory and experimental results why this algorithm is a 
> good choice.
> - Comparing the various tradeoffs and finalizing the algorithm before 
> implementation
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6306) Readme points to dead link

2015-03-12 Thread Theodore Vasiloudis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358850#comment-14358850
 ] 

Theodore Vasiloudis commented on SPARK-6306:


I'll keep that in mind in the future.

> Readme points to dead link
> --
>
> Key: SPARK-6306
> URL: https://issues.apache.org/jira/browse/SPARK-6306
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Theodore Vasiloudis
>Priority: Trivial
> Fix For: 1.4.0
>
>
> The link to "Specifying the Hadoop Version" now points to 
> http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version.
> The correct link is: 
> http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6300) sc.addFile(path) does not support the relative path.

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6300:
-
Priority: Minor  (was: Critical)
Target Version/s:   (was: 1.3.1)

> sc.addFile(path) does not support the relative path.
> 
>
> Key: SPARK-6300
> URL: https://issues.apache.org/jira/browse/SPARK-6300
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.2.1
>Reporter: DoingDone9
>Assignee: DoingDone9
>Priority: Minor
>
> when i run cmd like that sc.addFile("../test.txt"), it did not work and throw 
> an exception
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: file:../test.txt
> at org.apache.hadoop.fs.Path.initialize(Path.java:206)
> at org.apache.hadoop.fs.Path.(Path.java:172) 
> 
> ...
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> file:../test.txt
> at java.net.URI.checkPath(URI.java:1804)
> at java.net.URI.(URI.java:752)
> at org.apache.hadoop.fs.Path.initialize(Path.java:203)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6299) ClassNotFoundException when running groupByKey with class defined in REPL.

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358816#comment-14358816
 ] 

Sean Owen commented on SPARK-6299:
--

Hm, is this supposed to work? the class is not defined outside your driver 
process, and isn't found on the executors as a result.

> ClassNotFoundException when running groupByKey with class defined in REPL.
> --
>
> Key: SPARK-6299
> URL: https://issues.apache.org/jira/browse/SPARK-6299
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.2.1
>Reporter: Kevin (Sangwoo) Kim
>Priority: Critical
>
> Anyone can reproduce this issue by the code below
> (runs well in local mode, got exception with clusters)
> (it runs well in Spark 1.1.1)
> case class ClassA(value: String)
> val rdd = sc.parallelize(List(("k1", ClassA("v1")), ("k1", ClassA("v2")) ))
> rdd.groupByKey.collect
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 162 
> in stage 1.0 failed 4 times, most recent failure: Lost task 162.3 in stage 
> 1.0 (TID 1027, ip-172-16-182-27.ap-northeast-1.compute.internal): 
> java.lang.ClassNotFoundException: $iwC$$iwC$$iwC$$iwC$UserRelationshipRow
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:274)
> at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:91)
> at 
> org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
> at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

[jira] [Commented] (SPARK-6301) Unable to load external jars while submitting Spark Job

2015-03-12 Thread raju patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358815#comment-14358815
 ] 

raju patel commented on SPARK-6301:
---

I am trying to call Java functions which is basically loading java classes from 
the jar using Python .To achieve this goal I am using jnius which acts as a 
bridge between Python and Java.
When I submit the spark job
spark-submit --master local --jars /pathto/jar pyhtonfile.py
It gives me the above mentioned error Class not found 'classname' that are 
present inside that Jar
Yes, I have carefully verified all the classes are present inside the jar.

Please let me know if you want to know any other details.

> Unable to load external jars while submitting Spark Job
> ---
>
> Key: SPARK-6301
> URL: https://issues.apache.org/jira/browse/SPARK-6301
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 1.2.0
>Reporter: raju patel
>
> We are using Jnius to call Java functions from Python. But when we are trying 
> to submit the job using Spark,it is not able to load the java classes that 
> are provided in the --jars option, although it is successfully able to load 
> python class.
> The Error is like this :
>  c = find_javaclass(clsname)
>  File "jnius_export_func.pxi", line 23, in jnius.find_javaclass 
> (jnius/jnius.c:12815)
> JavaException: Class not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6275) Miss toDF() function in docs/sql-programming-guide.md

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6275:
-
Priority: Trivial  (was: Minor)
Assignee: zzc

This is also too minor to bother with a JIRA.

> Miss toDF() function in docs/sql-programming-guide.md 
> --
>
> Key: SPARK-6275
> URL: https://issues.apache.org/jira/browse/SPARK-6275
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: zzc
>Assignee: zzc
>Priority: Trivial
> Fix For: 1.4.0
>
>
> Miss toDF() function in docs/sql-programming-guide.md 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358799#comment-14358799
 ] 

Sean Owen commented on SPARK-6286:
--

[~dragos] I think it would be reasonable to handle this like {{TASK_LOST}}. I 
agree that there is not a reason to expect Mesos will be downgraded, and the 
required version is already required by Spark. This is also a little important 
to make sure this message is handled as intended and does not cause an 
exception. You want to make the simple PR?

> Handle TASK_ERROR in TaskState
> --
>
> Key: SPARK-6286
> URL: https://issues.apache.org/jira/browse/SPARK-6286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Iulian Dragos
>Priority: Minor
>  Labels: mesos
>
> Scala warning:
> {code}
> match may not be exhaustive. It would fail on the following input: TASK_ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6301) Unable to load external jars while submitting Spark Job

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6301:
-
Priority: Major  (was: Blocker)

Until it's clear what is being reported, this should not be marked "Blocker". 
Can you elaborate what class is not found, what you are running? Can you load 
the Java class without this third party library? Have you double-checked that 
the class is in your jar? It is not clear this is a Spark problem. 

> Unable to load external jars while submitting Spark Job
> ---
>
> Key: SPARK-6301
> URL: https://issues.apache.org/jira/browse/SPARK-6301
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 1.2.0
>Reporter: raju patel
>
> We are using Jnius to call Java functions from Python. But when we are trying 
> to submit the job using Spark,it is not able to load the java classes that 
> are provided in the --jars option, although it is successfully able to load 
> python class.
> The Error is like this :
>  c = find_javaclass(clsname)
>  File "jnius_export_func.pxi", line 23, in jnius.find_javaclass 
> (jnius/jnius.c:12815)
> JavaException: Class not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6275) Miss toDF() function in docs/sql-programming-guide.md

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6275.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

> Miss toDF() function in docs/sql-programming-guide.md 
> --
>
> Key: SPARK-6275
> URL: https://issues.apache.org/jira/browse/SPARK-6275
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: zzc
>Assignee: zzc
>Priority: Trivial
> Fix For: 1.4.0
>
>
> Miss toDF() function in docs/sql-programming-guide.md 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6306) Readme points to dead link

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358784#comment-14358784
 ] 

Sean Owen commented on SPARK-6306:
--

For a trivial change, a JIRA is just overhead. You don't need one unless there 
is a meaningful difference between the problem description and the fix itself.

> Readme points to dead link
> --
>
> Key: SPARK-6306
> URL: https://issues.apache.org/jira/browse/SPARK-6306
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Theodore Vasiloudis
>Priority: Trivial
> Fix For: 1.4.0
>
>
> The link to "Specifying the Hadoop Version" now points to 
> http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version.
> The correct link is: 
> http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6306) Readme points to dead link

2015-03-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6306.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4999
[https://github.com/apache/spark/pull/4999]

> Readme points to dead link
> --
>
> Key: SPARK-6306
> URL: https://issues.apache.org/jira/browse/SPARK-6306
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Theodore Vasiloudis
>Priority: Trivial
> Fix For: 1.4.0
>
>
> The link to "Specifying the Hadoop Version" now points to 
> http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version.
> The correct link is: 
> http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6306) Readme points to dead link

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358687#comment-14358687
 ] 

Apache Spark commented on SPARK-6306:
-

User 'thvasilo' has created a pull request for this issue:
https://github.com/apache/spark/pull/4999

> Readme points to dead link
> --
>
> Key: SPARK-6306
> URL: https://issues.apache.org/jira/browse/SPARK-6306
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Theodore Vasiloudis
>Priority: Trivial
>
> The link to "Specifying the Hadoop Version" now points to 
> http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version.
> The correct link is: 
> http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6306) Readme points to dead link

2015-03-12 Thread Theodore Vasiloudis (JIRA)

Theodore Vasiloudis created SPARK-6306:
--

 Summary: Readme points to dead link
 Key: SPARK-6306
 URL: https://issues.apache.org/jira/browse/SPARK-6306
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Theodore Vasiloudis
Priority: Trivial


The link to "Specifying the Hadoop Version" now points to 
http://spark.apache.org/docs/latest/building-with-maven.html#specifying-the-hadoop-version.

The correct link is: 
http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358625#comment-14358625
 ] 

Apache Spark commented on SPARK-6305:
-

User 'liorchaga' has created a pull request for this issue:
https://github.com/apache/spark/pull/4998

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Tal Sliwowicz
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6305) Add support for log4j 2.x to Spark

2015-03-12 Thread Tal Sliwowicz (JIRA)

Tal Sliwowicz created SPARK-6305:


 Summary: Add support for log4j 2.x to Spark
 Key: SPARK-6305
 URL: https://issues.apache.org/jira/browse/SPARK-6305
 Project: Spark
  Issue Type: New Feature
  Components: Build
Reporter: Tal Sliwowicz


log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-12 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358583#comment-14358583
 ] 

Manoj Kumar commented on SPARK-5692:


I'm not sure about Eclipse, but I work just on sublime text and build it using 
the instructions given here. 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools


> Model import/export for Word2Vec
> 
>
> Key: SPARK-5692
> URL: https://issues.apache.org/jira/browse/SPARK-5692
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: ANUPAM MEDIRATTA
>
> Supoort save and load for Word2VecModel. We may want to discuss whether we 
> want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-12 Thread ANUPAM MEDIRATTA (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358573#comment-14358573
 ] 

ANUPAM MEDIRATTA commented on SPARK-5692:
-

I tried working on it. I am new to spark and scala.

I am not able to run tests in scala ide. I am not able to compile the code base 
in eclipse so that I can run tests (to verify my code).

any instructions on how to compile this codebase in eclipse (scala ide)?

> Model import/export for Word2Vec
> 
>
> Key: SPARK-5692
> URL: https://issues.apache.org/jira/browse/SPARK-5692
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: ANUPAM MEDIRATTA
>
> Supoort save and load for Word2VecModel. We may want to discuss whether we 
> want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6227) PCA and SVD for PySpark

2015-03-12 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358529#comment-14358529
 ] 

Meethu Mathew commented on SPARK-6227:
--

[~mengxr]  Please give your inputs on the same.

> PCA and SVD for PySpark
> ---
>
> Key: SPARK-6227
> URL: https://issues.apache.org/jira/browse/SPARK-6227
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.2.1
>Reporter: Julien Amelot
>
> The Dimensionality Reduction techniques are not available via Python (Scala + 
> Java only).
> * Principal component analysis (PCA)
> * Singular value decomposition (SVD)
> Doc:
> http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6256) Python MLlib API missing items: Regression

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358521#comment-14358521
 ] 

Apache Spark commented on SPARK-6256:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4997

> Python MLlib API missing items: Regression
> --
>
> Key: SPARK-6256
> URL: https://issues.apache.org/jira/browse/SPARK-6256
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> This JIRA lists items missing in the Python API for this sub-package of MLlib.
> This list may be incomplete, so please check again when sending a PR to add 
> these features to the Python API.
> Also, please check for major disparities between documentation; some parts of 
> the Python API are less well-documented than their Scala counterparts.  Some 
> items may be listed in the umbrella JIRA linked to this task.
> LassoWithSGD
> * setIntercept
> * setValidateData
> LinearRegressionWithSGD, RidgeRegressionWithSGD
> * setValidateData



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6301) Unable to load external jars while submitting Spark Job

2015-03-12 Thread raju patel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

raju patel updated SPARK-6301:
--
Description: 
We are using Jnius to call Java functions from Python. But when we are trying 
to submit the job using Spark,it is not able to load the java classes that are 
provided in the --jars option, although it is successfully able to load python 
class.
The Error is like this :
 c = find_javaclass(clsname)
 File "jnius_export_func.pxi", line 23, in jnius.find_javaclass 
(jnius/jnius.c:12815)
JavaException: Class not found

  was:
We are using Jnius to call Java functions from Python. But when we are trying 
to submit the job using Spark,it is not able to load the java classes that are 
provided in the --jars option although it is successfully able to load python 
class.
The Error is like this :
 c = find_javaclass(clsname)
 File "jnius_export_func.pxi", line 23, in jnius.find_javaclass 
(jnius/jnius.c:12815)
JavaException: Class not found


> Unable to load external jars while submitting Spark Job
> ---
>
> Key: SPARK-6301
> URL: https://issues.apache.org/jira/browse/SPARK-6301
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 1.2.0
>Reporter: raju patel
>Priority: Blocker
>
> We are using Jnius to call Java functions from Python. But when we are trying 
> to submit the job using Spark,it is not able to load the java classes that 
> are provided in the --jars option, although it is successfully able to load 
> python class.
> The Error is like this :
>  c = find_javaclass(clsname)
>  File "jnius_export_func.pxi", line 23, in jnius.find_javaclass 
> (jnius/jnius.c:12815)
> JavaException: Class not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6304) Checkpointing doesn't retain driver port

2015-03-12 Thread Marius Soutier (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Soutier updated SPARK-6304:
--
Description: 
In a check-pointed Streaming application running on a fixed driver port, the 
setting "spark.driver.port" is not loaded when recovering from a checkpoint.

(The driver is then started on a random port.)


  was:
In a check-pointed Streaming application running on a fixed driver port, the 
setting "spark.driver.port" is not loaded when recovering from checkpoint.

(The driver is then started on a random port.)



> Checkpointing doesn't retain driver port
> 
>
> Key: SPARK-6304
> URL: https://issues.apache.org/jira/browse/SPARK-6304
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.1
>Reporter: Marius Soutier
>
> In a check-pointed Streaming application running on a fixed driver port, the 
> setting "spark.driver.port" is not loaded when recovering from a checkpoint.
> (The driver is then started on a random port.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6304) Checkpointing doesn't retain driver port

2015-03-12 Thread Marius Soutier (JIRA)

Marius Soutier created SPARK-6304:
-

 Summary: Checkpointing doesn't retain driver port
 Key: SPARK-6304
 URL: https://issues.apache.org/jira/browse/SPARK-6304
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.1
Reporter: Marius Soutier


In a check-pointed Streaming application running on a fixed driver port, the 
setting "spark.driver.port" is not loaded when recovering from checkpoint.

(The driver is then started on a random port.)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6282) Strange Python import error when using random() in a lambda function

2015-03-12 Thread Pavel Laskov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358504#comment-14358504
 ] 

Pavel Laskov commented on SPARK-6282:
-

Hi Sven and Joseph,

Thanks for a quick reply to my bug report. I still think the problem is 
somewhere in Spark. Here is an autonomous code snippet which triggers the error 
on my system. Uncommenting any of the imports marked with ### causes a crash. 
Switching to "import random / random.random()" fixes the problems. None of the 
functions imported in the ### lines is used in the test code. Looks like a very 
obscure dependency of some mllib components on _winreg? 


from random import random
# import random
from pyspark.context import SparkContext
from pyspark.mllib.rand import RandomRDDs
### Any of these imports causes the crash
### from pyspark.mllib.tree import RandomForest, DecisionTreeModel
### from pyspark.mllib.linalg import SparseVector
### from pyspark.mllib.regression import LabeledPoint

if __name__ == "__main__":
 
sc = SparkContext(appName="Random() bug test")
data = RandomRDDs.normalVectorRDD(sc,numRows=1,numCols=200)
d = data.map(lambda x: (random(), x))
print d.first()


Here is the full trace of the error:

Traceback (most recent call last):
  File "/home/laskov/research/pe-class/python/src/experiments/test_random.py", 
line 16, in 
print d.first()
  File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 1139, in 
first
rs = self.take(1)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 1091, in take
totalParts = self._jrdd.partitions().size()
  File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 2115, in 
_jrdd
pickled_command = ser.dumps(command)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/serializers.py", line 406, 
in dumps
return cloudpickle.dumps(obj, 2)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 816, 
in dumps
cp.dump(obj)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 133, 
in dump
return pickle.Pickler.dump(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple
save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 254, 
in save_function
self.save_function_tuple(obj, [themodule])
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 304, 
in save_function_tuple
save((code, closure, base_globals))
  File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends
save(x)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 254, 
in save_function
self.save_function_tuple(obj, [themodule])
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 304, 
in save_function_tuple
save((code, closure, base_globals))
  File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 636, in _batch_appends
save(tmp[0])
  File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 249, 
in save_function
self.save_function_tuple(obj, modList)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 309, 
in save_function_tuple
save(f_globals)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 174, 
in save_dict
pickle.Pickler.save_dict(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
  File "/usr/lib/pyth

[jira] [Commented] (SPARK-6190) create LargeByteBuffer abstraction for eliminating 2GB limit on blocks

2015-03-12 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357609#comment-14357609
 ] 

Imran Rashid commented on SPARK-6190:
-

Hi [~rxin],

I've been adding scatterered notes across the various tickets which I think has 
led to a lot of the confusion -- lemme try to summarize things here.

I completely agree about the importance of the various cases.  Caching large 
blocks is *by far* the most important case.  However, I think its worth 
exploring the other cases now for two reasons.  (a) I still think they need to 
be solved eventually for a consistent user experience.  Eg., if caching locally 
works, but reading from a remote cache doesn't, a user will be baffled when on 
run 1 of their job, everything works fine, but run 2, with the same data & same 
code, tasks get scheduled slightly different and require a remote fetch, and 
KABOOM!  thats the kind of experience that makes the average user want to throw 
spark out the window. (This is actually what I thought you were pointing out in 
your comments on the earlier jira -- that we can forget about uploading at this 
point, but need to make sure remote fetches work.) (b) We should make sure that 
whatever approach we take at least leaves the door open for solutions to all 
the problems.  At least for myself, I wasn't sure if this approach would work 
for everything initially, but exploring the options makes me feel like its all 
possible.  (which gets to your question about large blocks vs. multi-blocks.)

The proposal isn't exactly "read-only", it also supports writing via 
{{LargeByteBufferOutputStream}}.  It turns out thats all we need.  The 
BlockManager currently exposes {{ByteBuffers}}, but it actually doesn't need 
to.  For example, currently local shuffle fetches only expose a FileInputStream 
over the data -- thats why there isn't a 2GB limit on local shuffles. (it gets 
wrapped in a {{FileSegmentManagedBuffer}} and eventually read here: 
https://github.com/apache/spark/blob/55c4831d68c8326380086b5540244f984ea9ec27/core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala#L300)
  It also makes sense that we only need stream access, since RDDs & broadcast 
vars are immutable -- eg. we never say "treat bytes 37-40 as an int, and 
increment its value".

Fundamentally, blocks are always created via serialization -- more 
specifically, {{Serializer#serializeStream}}.  Obviously there isn't any limit 
when writing to a {{FileOutputStream}}, we just need a way to write to an 
in-memory output stream over 2GB.  We can create an {{Array\[Array\[Byte\]\]}} 
already with {{ByteArrayChunkOutputStream}} 
https://github.com/apache/spark/blob/55c4831d68c8326380086b5540244f984ea9ec27/core/src/main/scala/org/apache/spark/util/io/ByteArrayChunkOutputStream.scala
 (currently used to create multiple blocks by TorrentBroadcast).  We can use 
that to write out more than 2GB, eg. creating many chunks of max size 64K.

Similarly, we need a way to convert the various representations of large blocks 
back into {{InputStreams}}.  File-based input streams have no problem 
({{DiskStore}} only fails b/c the code currently tries to convert to a 
{{ByteBuffer}}, though conceptually this is unnecessary).  For in-memory large 
blocks, represented as {{Array\[Array\[Byte\]\]}}, again we can do the same as 
{{TorrentBroadcast}}.  The final case is network transfer.  This involves 
changing the netty frame decoder to handle frames that are > 2GB -- then we 
just use the same input stream for the in-memory case.  That was the last piece 
that I was prototyping, and was mentioning in my latest comments.  I have an 
implementation available here: 
https://github.com/squito/spark/blob/5e83a55daa30a19840214f77681248e112635bf6/network/common/src/main/java/org/apache/spark/network/protocol/FixedChunkLargeFrameDecoder.java

Its a good question about whether we should allow large blocks, or instead we 
should have blocks be limited at 2GB and have another layer put multiple blocks 
together.  I don't know if I have very clear objective arguments for one vs. 
the other, but I did consider both and felt like this version was much simpler 
to implement.  Especially given the limited api that is actually needed (only 
stream access), the changes proposed here really aren't that big.  It keeps the 
changes more nicely contained to the layers underneath BlockManager (with 
mostly cosmetic / naming changes required in outer layers since we'd no longer 
be returning ByteBuffers).  Going down this road certainly doesn't prevent us 
from later deciding to have blocks be fragmented (then its just a question of 
naming: are "blocks" the smallest units that we work with in the internals, and 
there is some new logical unit which wraps blocks?  or are "blocks" the logical 
unit that is exposed, and there is some new smaller unit which is used by the 
internals?

[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

2015-03-12 Thread mgdadv (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357610#comment-14357610
 ] 

mgdadv commented on SPARK-6189:
---

While the dot is legal in R and SQL, I don't think there is a nice way of 
making it
legal in python. So at least in the Spark python code, I think something should
be done about it.

I just realized that the automatic renaming can cause problems if that entry
already exists.  For example, what if GNP_deflator was already in the data set
and then GNP.deflator gets changed.

I think the best thing to do is to just warn the user by printing out a warning
message. I have changed the patch accordingly.

Here is some example code for pyspark:

import pandas as pd
df = pd.read_csv(StringIO.StringIO("a.b,a,c\n101,102,103\n201,202,203"))
spdf = sqlCtx.createDataFrame(df)
spdf.take(2)
spdf[spdf.a==102].take(2)

So far this works, but this fails:
spdf[spdf.a.b==101].take(2)

In pandas df.a.b doesn't work either, but the fields can be accessed via the 
string "a.b", i.e.:
df["a.b"]


> Pandas to DataFrame conversion should check field names for periods
> ---
>
> Key: SPARK-6189
> URL: https://issues.apache.org/jira/browse/SPARK-6189
> Project: Spark
>  Issue Type: Improvement
>  Components: DataFrame, SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
> DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
> dataset had a column with a period in it (column "GNP.deflator" in the 
> "longley" dataset).  When I tried to select it using the Spark DataFrame DSL, 
> I could not because the DSL thought the period was selecting a field within 
> GNP.
> Also, since "GNP" is another field's name, it gives an error which could be 
> obscure to users, complaining:
> {code}
> org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
> type DoubleType;
> {code}
> We should either handle periods in column names or check during loading and 
> warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6303) Average should be in canBeCodeGened list

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358399#comment-14358399
 ] 

Apache Spark commented on SPARK-6303:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4996

> Average should be in canBeCodeGened list
> 
>
> Key: SPARK-6303
> URL: https://issues.apache.org/jira/browse/SPARK-6303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, 
> CollectHashSet. Average should be in the list too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6303) Average should be in canBeCodeGened list

2015-03-12 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-6303:
--

 Summary: Average should be in canBeCodeGened list
 Key: SPARK-6303
 URL: https://issues.apache.org/jira/browse/SPARK-6303
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh


Currently canBeCodeGened only checks Sum, Count, Max, CombineSetsAndCount, 
CollectHashSet. Average should be in the list too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6227) PCA and SVD for PySpark

2015-03-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6227:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-6100

> PCA and SVD for PySpark
> ---
>
> Key: SPARK-6227
> URL: https://issues.apache.org/jira/browse/SPARK-6227
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.2.1
>Reporter: Julien Amelot
>Priority: Minor
>
> The Dimensionality Reduction techniques are not available via Python (Scala + 
> Java only).
> * Principal component analysis (PCA)
> * Singular value decomposition (SVD)
> Doc:
> http://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6286) Handle TASK_ERROR in TaskState

2015-03-12 Thread Iulian Dragos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357310#comment-14357310
 ] 

Iulian Dragos commented on SPARK-6286:
--

Good point. It's been [introduced in 
0.21.0|http://mesos.apache.org/blog/mesos-0-21-0-released/]. According to 
[pom.xml|https://github.com/apache/spark/blob/master/pom.xml#L119], Spark 
depends on `0.21.0`, so it seems safe to handle it. Feel free to close if you 
think it's going to break something else.

> Handle TASK_ERROR in TaskState
> --
>
> Key: SPARK-6286
> URL: https://issues.apache.org/jira/browse/SPARK-6286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Iulian Dragos
>Priority: Minor
>  Labels: mesos
>
> Scala warning:
> {code}
> match may not be exhaustive. It would fail on the following input: TASK_ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6198) Support "select current_database()"

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358361#comment-14358361
 ] 

Apache Spark commented on SPARK-6198:
-

User 'DoingDone9' has created a pull request for this issue:
https://github.com/apache/spark/pull/4995

> Support "select current_database()"
> ---
>
> Key: SPARK-6198
> URL: https://issues.apache.org/jira/browse/SPARK-6198
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.1
>Reporter: DoingDone9
>
> The method(evaluate) has changed in UDFCurrentDB, it just throws a 
> exception.But hiveUdfs call this method and failed.
> @Override
>   public Object evaluate(DeferredObject[] arguments) throws HiveException {
> throw new IllegalStateException("never");
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6285) Duplicated code leads to errors

2015-03-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357255#comment-14357255
 ] 

Sean Owen commented on SPARK-6285:
--

I do not observe any compilation problem in Maven or IntelliJ though, so I 
don't know if it's an actual problem in the source.

That said, I don't see why there are two copies of the same class; one can be 
removed. But the containing class in the main source tree looks like test code. 
I think you can try moving it to the test tree too as part of a fix. 
ParquetTest is only used from test code, and ParquetTestData is... only used in 
sql's README.md? maybe my IDE is reading that wrong.

> Duplicated code leads to errors
> ---
>
> Key: SPARK-6285
> URL: https://issues.apache.org/jira/browse/SPARK-6285
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Iulian Dragos
>
> The following class is duplicated inside 
> [ParquetTestData|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala#L39]
>  and 
> [ParquetIOSuite|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala#L44],
>  with exact same code and fully qualified name:
> {code}
> org.apache.spark.sql.parquet.TestGroupWriteSupport
> {code}
> The second one was introduced in 
> [3b395e10|https://github.com/apache/spark/commit/3b395e10510782474789c9098084503f98ca4830],
>  but even though it mentions that `ParquetTestData` should be removed later, 
> I couldn't find a corresponding Jira ticket.
> This duplicate class causes the Eclipse builder to fail (since src/main and 
> src/test are compiled together in Eclipse, unlike Sbt).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6284) Support framework authentication and role in Mesos framework

2015-03-12 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357180#comment-14357180
 ] 

Timothy Chen commented on SPARK-6284:
-

https://github.com/apache/spark/pull/4960

> Support framework authentication and role in Mesos framework
> 
>
> Key: SPARK-6284
> URL: https://issues.apache.org/jira/browse/SPARK-6284
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
>
> Support framework authentication and role in both Coarse grain and fine grain 
> mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5987) Model import/export for GaussianMixtureModel

2015-03-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357109#comment-14357109
 ] 

Joseph K. Bradley commented on SPARK-5987:
--

This isn't a bug in Spark SQL.  The issue is that we haven't defined a 
UserDefinedType for Matrices.  (We should, but haven't yet.)  When I said 
"basic types," I meant the types enumerated on the SQL programming guide 
(basically, Array[Double] or Seq[Double] will be best).  I'd recommend 
flattening the matrix into an Array[Double] instead of having nested types.  
The nesting is less efficient because of all of the extra objects it creates.

> Model import/export for GaussianMixtureModel
> 
>
> Key: SPARK-5987
> URL: https://issues.apache.org/jira/browse/SPARK-5987
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>
> Support save/load for GaussianMixtureModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6302) GeneratedAggregate uses wrong schema on updateProjection

2015-03-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358333#comment-14358333
 ] 

Apache Spark commented on SPARK-6302:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4994

> GeneratedAggregate uses wrong schema on updateProjection
> 
>
> Key: SPARK-6302
> URL: https://issues.apache.org/jira/browse/SPARK-6302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> The updateProjection in GeneratedAggregate now uses the updateSchema as its 
> input schema. In fact, the schema should be child.output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6301) Unable to load external jars while submitting Spark Job

2015-03-12 Thread raju patel (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

raju patel updated SPARK-6301:
--
Description: 
We are using Jnius to call Java functions from Python. But when we are trying 
to submit the job using Spark,it is not able to load the java classes that are 
provided in the --jars option although it is successfully able to load python 
class.
The Error is like this :
 c = find_javaclass(clsname)
 File "jnius_export_func.pxi", line 23, in jnius.find_javaclass 
(jnius/jnius.c:12815)
JavaException: Class not found

  was:
We are using Jnius to call Java functions from Python. But when we are trying 
to submit the job using Spark,it is not able to load the java classes that are 
provided in the --jars option although it is successfully able to load python 
class.
The Error is like this :
 c = find_javaclass(clsname) File "jnius_export_func.pxi", line 23, in 
jnius.find_javaclass (jnius/jnius.c:12815)


> Unable to load external jars while submitting Spark Job
> ---
>
> Key: SPARK-6301
> URL: https://issues.apache.org/jira/browse/SPARK-6301
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 1.2.0
>Reporter: raju patel
>Priority: Blocker
>
> We are using Jnius to call Java functions from Python. But when we are trying 
> to submit the job using Spark,it is not able to load the java classes that 
> are provided in the --jars option although it is successfully able to load 
> python class.
> The Error is like this :
>  c = find_javaclass(clsname)
>  File "jnius_export_func.pxi", line 23, in jnius.find_javaclass 
> (jnius/jnius.c:12815)
> JavaException: Class not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6302) GeneratedAggregate uses wrong schema on updateProjection

2015-03-12 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-6302:
--

 Summary: GeneratedAggregate uses wrong schema on updateProjection
 Key: SPARK-6302
 URL: https://issues.apache.org/jira/browse/SPARK-6302
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor


The updateProjection in GeneratedAggregate now uses the updateSchema as its 
input schema. In fact, the schema should be child.output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-03-12 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357069#comment-14357069
 ] 

Manoj Kumar commented on SPARK-5692:


okay, great

> Model import/export for Word2Vec
> 
>
> Key: SPARK-5692
> URL: https://issues.apache.org/jira/browse/SPARK-5692
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: ANUPAM MEDIRATTA
>
> Supoort save and load for Word2VecModel. We may want to discuss whether we 
> want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 126 matches

Mail list logo