date:20140819


[ 
https://issues.apache.org/jira/browse/SPARK-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101915#comment-14101915
 ] 

Apache Spark commented on SPARK-2873:
-

User 'guowei2' has created a pull request for this issue:
https://github.com/apache/spark/pull/2029

 OOM happens when group by and join operation with big data 
 ---

 Key: SPARK-2873
 URL: https://issues.apache.org/jira/browse/SPARK-2873
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: guowei





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3119) Re-implement TorrentBroadcast

Reynold Xin created SPARK-3119:
--

 Summary: Re-implement TorrentBroadcast
 Key: SPARK-3119
 URL: https://issues.apache.org/jira/browse/SPARK-3119
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3119) Re-implement TorrentBroadcast

[
https://issues.apache.org/jira/browse/SPARK-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin updated SPARK-3119:
---

Description:
TorrentBroadcast is unnecessarily complicated:

1. It tracks a lot of mutable states, such as total number of bytes, number of
blocks fetched.
2. It has at least two data structures that are not needed: TorrentInfo and
TorrentBlock.
3. It uses getSingle on executors to get the block instead of getLocal,
resulting in an extra roundtrip to look up the location of the block when the
block doesn't exist yet.
4. It has a metadata block that is completely unnecessary.
5. It does an extra memory copy during deserialization to copy all the blocks
into a single giant array.

Re-implement TorrentBroadcast
-

Key: SPARK-3119
URL: https://issues.apache.org/jira/browse/SPARK-3119
Project: Spark
Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin

TorrentBroadcast is unnecessarily complicated:
1. It tracks a lot of mutable states, such as total number of bytes, number
of blocks fetched.
2. It has at least two data structures that are not needed: TorrentInfo and
TorrentBlock.
3. It uses getSingle on executors to get the block instead of getLocal,
resulting in an extra roundtrip to look up the location of the block when the
block doesn't exist yet.
4. It has a metadata block that is completely unnecessary.
5. It does an extra memory copy during deserialization to copy all the blocks
into a single giant array.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3119) Re-implement TorrentBroadcast


 [ 
https://issues.apache.org/jira/browse/SPARK-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3119:
---

Component/s: Spark Core

 Re-implement TorrentBroadcast
 -

 Key: SPARK-3119
 URL: https://issues.apache.org/jira/browse/SPARK-3119
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 TorrentBroadcast is unnecessarily complicated:
 1. It tracks a lot of mutable states, such as total number of bytes, number 
 of blocks fetched.
 2. It has at least two data structures that are not needed: TorrentInfo and 
 TorrentBlock.
 3. It uses getSingle on executors to get the block instead of getLocal, 
 resulting in an extra roundtrip to look up the location of the block when the 
 block doesn't exist yet.
 4. It has a metadata block that is completely unnecessary.
 5. It does an extra memory copy during deserialization to copy all the blocks 
 into a single giant array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3119) Re-implement TorrentBroadcast


[ 
https://issues.apache.org/jira/browse/SPARK-3119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101931#comment-14101931
 ] 

Apache Spark commented on SPARK-3119:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2030

 Re-implement TorrentBroadcast
 -

 Key: SPARK-3119
 URL: https://issues.apache.org/jira/browse/SPARK-3119
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 TorrentBroadcast is unnecessarily complicated:
 1. It tracks a lot of mutable states, such as total number of bytes, number 
 of blocks fetched.
 2. It has at least two data structures that are not needed: TorrentInfo and 
 TorrentBlock.
 3. It uses getSingle on executors to get the block instead of getLocal, 
 resulting in an extra roundtrip to look up the location of the block when the 
 block doesn't exist yet.
 4. It has a metadata block that is completely unnecessary.
 5. It does an extra memory copy during deserialization to copy all the blocks 
 into a single giant array.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3120) Local Dirs is not useful in yarn-client mode

2014-08-19 Thread hzw (JIRA)

hzw created SPARK-3120:
--

 Summary: Local Dirs is not useful in yarn-client mode
 Key: SPARK-3120
 URL: https://issues.apache.org/jira/browse/SPARK-3120
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.0.2
 Environment: Spark 1.0.2
Yarn 2.3.0
Reporter: hzw


I was using spark1.0.2 and hadoop 2.3.0 to run a spark application on yarn.
I was excepted to set the spark.local.dir to separate  the shuffle files to 
many disks, so I exported LOCAL_DIRS in Spark-env.sh.
But it failed to create the local dirs in my specify path.
It just go to the path in 
/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/ as the hadoop default 
path.

To reappear this:
1.Do not set the “yarn.nodemanager.local-dirs” in yarn-site.xml which influence 
the result.
2.run a job and then find the executor log at the INFO DiskBlockManager: 
Created local directory at ..

Inaddtion, I tried to add the exported LOCAL_DIRS in yarn-env.sh. It will 
lanch the LOCAL_DIRS value in the ExecutorLancher and it still would be 
overwrite by yarn in lanching the executor container.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3098) In some cases, operation groupByKey get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102012#comment-14102012
 ] 

Guoqiang Li edited comment on SPARK-3098 at 8/19/14 8:22 AM:
-

I found the error id is continuous. Seems there is something wrong with the 
{{zipWithIndex}}.


was (Author: gq):
I found the error id is continuous. Seems there is something wrong with 
the{{zipWithIndex}}.

  In some cases, operation groupByKey get a wrong results
 

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation groupByKey get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102012#comment-14102012
 ] 

Guoqiang Li commented on SPARK-3098:


I found the error id is continuous. Seems there is something wrong with 
the{{zipWithIndex}}.

  In some cases, operation groupByKey get a wrong results
 

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation groupByKey get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102015#comment-14102015
 ] 

Sean Owen commented on SPARK-3098:
--

zipWithIndex returns an RDD[(T,Long)]. It does not create a continuous value, 
and doesn't add a value in the first position in the tuples such that it may be 
used as a key. Your IDs do not seem to be floating-point values here. What does 
your comment mean?

  In some cases, operation groupByKey get a wrong results
 

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation groupByKey get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102024#comment-14102024
 ] 

Guoqiang Li commented on SPARK-3098:


the (id, value) pairs are generated there zipWithIndex.
Reproduce the code:
{code}
   val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex()
val e = c.map(t = (t._1, t._2.toString))
val d = c.filter(t = t._1 % 100  5)
e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3)
{code}

  In some cases, operation groupByKey get a wrong results
 

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


 [ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3098:
---

Summary:  In some cases, operation zipWithIndex get a wrong results  (was:  
In some cases, operation groupByKey get a wrong results)

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3037) Add ArrayType containing null value support to Parquet.


[ 
https://issues.apache.org/jira/browse/SPARK-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102034#comment-14102034
 ] 

Apache Spark commented on SPARK-3037:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2032

 Add ArrayType containing null value support to Parquet.
 ---

 Key: SPARK-3037
 URL: https://issues.apache.org/jira/browse/SPARK-3037
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Priority: Blocker

 Parquet support should handle {{ArrayType}} when {{containsNull}} is {{true}}.
 When {{containsNull}} is {{true}}, the schema should be as follows:
 {noformat}
 message root {
   optional group a (LIST) {
 repeated group bag {
   optional int32 array_element;
 }
   }
 }
 {noformat}
 FYI:
 Hive's Parquet writer *always* uses this schema, and reader can read only 
 from this schema, i.e. current Parquet support of SparkSQL is not compatible 
 with Hive.
 NOTICE:
 If Hive compatiblity is top priority, we also have to use this schma 
 regardless of {{containsNull}}, which will break backward compatibility.
 But using this schema could affect performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3036) Add MapType containing null value support to Parquet.


[ 
https://issues.apache.org/jira/browse/SPARK-3036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102033#comment-14102033
 ] 

Apache Spark commented on SPARK-3036:
-

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2032

 Add MapType containing null value support to Parquet.
 -

 Key: SPARK-3036
 URL: https://issues.apache.org/jira/browse/SPARK-3036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Priority: Blocker

 Current Parquet schema for {{MapType}} is as follows regardless of 
 {{valueContainsNull}}:
 {noformat}
 message root {
   optional group a (MAP) {
 repeated group map (MAP_KEY_VALUE) {
   required int32 key;
   required int32 value;
 }
   }
 }
 {noformat}
 and if the map contains {{null}} value, it throws runtime exception.
 To handle {{MapType}} containing {{null}} value, the schema should be as 
 follows if {{valueContainsNull}} is {{true}}:
 {noformat}
 message root {
   optional group a (MAP) {
 repeated group map (MAP_KEY_VALUE) {
   required int32 key;
   optional int32 value;
 }
   }
 }
 {noformat}
 FYI:
 Hive's Parquet writer *always* uses the latter schema, but reader can read 
 from both schema.
 NOTICE:
 This change will break backward compatibility when the schema is read from 
 Parquet metadata ({{org.apache.spark.sql.parquet.row.metadata}}).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102024#comment-14102024
 ] 

Guoqiang Li edited comment on SPARK-3098 at 8/19/14 8:55 AM:
-

the (id, value) pairs are generated there zipWithIndex.
The reproduce code:
{code}
   val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex()
val e = c.map(t = (t._1, t._2.toString))
val d = c.filter(t = t._1 % 100  5)
e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3)
{code}


was (Author: gq):
the (id, value) pairs are generated there zipWithIndex.
Reproduce the code:
{code}
   val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex()
val e = c.map(t = (t._1, t._2.toString))
val d = c.filter(t = t._1 % 100  5)
e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3)
{code}

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102024#comment-14102024
 ] 

Guoqiang Li edited comment on SPARK-3098 at 8/19/14 8:58 AM:
-

the (id, value) pairs are generated by zipWithIndex.
The reproduce code:
{code}
   val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex()
val e = c.map(t = (t._1, t._2.toString))
val d = c.filter(t = t._1 % 100  5)
e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3)
{code}


was (Author: gq):
the (id, value) pairs are generated there zipWithIndex.
The reproduce code:
{code}
   val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex()
val e = c.map(t = (t._1, t._2.toString))
val d = c.filter(t = t._1 % 100  5)
e.join(d).filter(t = t._2._1 != t._2._2.toString).take(3)
{code}

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2964) Fix wrong option (-S, --silent), and improve spark-sql and start-thriftserver to leverage bin/util.sh


 [ 
https://issues.apache.org/jira/browse/SPARK-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2964:
--

Summary: Fix wrong option (-S, --silent), and improve spark-sql and 
start-thriftserver to leverage bin/util.sh  (was: Improve spark-sql and 
start-thriftserver to leverage bin/util.sh)

 Fix wrong option (-S, --silent), and improve spark-sql and start-thriftserver 
 to leverage bin/util.sh
 -

 Key: SPARK-2964
 URL: https://issues.apache.org/jira/browse/SPARK-2964
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor

 Now, we have bin/utils.sh, so let's improve spark-sql and 
 start-thriftserver.sh to leverage utils.sh.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2964) Fix wrong option (-S, --silent), and improve spark-sql and start-thriftserver to leverage bin/util.sh


 [ 
https://issues.apache.org/jira/browse/SPARK-2964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2964:
--

Description: 
In spark-sql script, they expect -s option but it's wrong. It's typo for -S 
(large S). We need to fix that.
And now, we have bin/utils.sh, so let's improve spark-sql and 
start-thriftserver.sh to leverage utils.sh.

  was:Now, we have bin/utils.sh, so let's improve spark-sql and 
start-thriftserver.sh to leverage utils.sh.


 Fix wrong option (-S, --silent), and improve spark-sql and start-thriftserver 
 to leverage bin/util.sh
 -

 Key: SPARK-2964
 URL: https://issues.apache.org/jira/browse/SPARK-2964
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor

 In spark-sql script, they expect -s option but it's wrong. It's typo for -S 
 (large S). We need to fix that.
 And now, we have bin/utils.sh, so let's improve spark-sql and 
 start-thriftserver.sh to leverage utils.sh.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102044#comment-14102044
 ] 

Sean Owen commented on SPARK-3098:
--

It would be helpful if you would explain what you are trying to reproduce here; 
this is just code, and there's not a continuous value here, for example. It 
looks like you're producing overlapping sequences of numbers like 1..1, 
6001..16000, ... and then flattening and removing duplicates, to get the range 
1..47404000. That's zipped with its index to get these as (n,n-1) pairs.

Then you map the second element to a String, and do the same to a subset of the 
data, join them, and see if there are any mismatches, because there shouldn't 
be. All the keys are values are unique. But why would this demonstrate 
something about zipWithIndex more directly than a test of the RDD c?

More importantly, I ran this locally and got an empty Array, as expected.


  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102064#comment-14102064
 ] 

Guoqiang Li edited comment on SPARK-3098 at 8/19/14 9:41 AM:
-

We cluster on yarn.
You can try the following code in cluster mode
{code}
 val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex() 
c.join(c).filter(t = t._2._1 != t._2._2).take(3)
{code}
 = 
{code}
 Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
(36579712,(13,14)))
{code}


was (Author: gq):
We cluster on yarn.
You can try the following code in cluster mode
{code}
 val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex()
val e = c.map(t = (t._1, t._2))
e.join(e).filter(t = t._2._1 != t._2._2).take(3)
{code}
 = 
{code}
 Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
(36579712,(13,14)))
{code}

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3121) Wrong implementation of implicit bytesWritableConverter

2014-08-19 Thread Jakub Dubovsky (JIRA)

Jakub Dubovsky created SPARK-3121:
-

 Summary: Wrong implementation of implicit bytesWritableConverter
 Key: SPARK-3121
 URL: https://issues.apache.org/jira/browse/SPARK-3121
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
Reporter: Jakub Dubovsky
Priority: Minor


val path = ... //path to seq file with BytesWritable as type of both key and 
value
val file = sc.sequenceFile[Array[Byte],Array[Byte]](path)
file.take(1)(0)._1

This prints incorrect content of byte array. Actual content starts with correct 
one and some random bytes and zeros are appended. BytesWritable has two 
methods:

getBytes() - return content of all internal array which is often longer then 
actual value stored. It usually contains the rest of previous longer values

copyBytes() - return just begining of internal array determined by internal 
length property

It looks like in implicit conversion between BytesWritable and Array[byte] 
getBytes is used instead of correct copyBytes.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102101#comment-14102101
 ] 

Guoqiang Li commented on SPARK-3098:


Seems to be {{zipWithUniqueId}} also has this issue .
{code}
   val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithUniqueId() 
c.join(c).filter(t = t._2._1 != t._2._2).take(3)
{code}

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3106) Race Condition Issue Fix the order of resources in Connection


 [ 
https://issues.apache.org/jira/browse/SPARK-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3106:
--

Summary: *Race Condition Issue* Fix the order of resources in Connection  
(was: Suppress unwilling Exception and error messages caused by 
SendingConnection)

 *Race Condition Issue* Fix the order of resources in Connection
 ---

 Key: SPARK-3106
 URL: https://issues.apache.org/jira/browse/SPARK-3106
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Now, when we run Spark application, error message is appear on driver's log.
 The error message includes like as follows.
 * message caused by ClosedChannelException
 * message caused by CancelledKeyException
 * Corresponding SendingConnectionManagerId not found
 Those are mainly caused by the read behavior of SendingConnection.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3106) Race Condition Issue Fix the order of resources in Connection


 [ 
https://issues.apache.org/jira/browse/SPARK-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3106:
--

Description: 
Now, when we run Spark application, error message is appear on driver's log.
The error message includes like as follows.

* message caused by ClosedChannelException
* message caused by CancelledKeyException
* Corresponding SendingConnectionManagerId not found

Those are mainly caused by the race condition issue in 

  was:
Now, when we run Spark application, error message is appear on driver's log.
The error message includes like as follows.

* message caused by ClosedChannelException
* message caused by CancelledKeyException
* Corresponding SendingConnectionManagerId not found

Those are mainly caused by the read behavior of SendingConnection.


 *Race Condition Issue* Fix the order of resources in Connection
 ---

 Key: SPARK-3106
 URL: https://issues.apache.org/jira/browse/SPARK-3106
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Now, when we run Spark application, error message is appear on driver's log.
 The error message includes like as follows.
 * message caused by ClosedChannelException
 * message caused by CancelledKeyException
 * Corresponding SendingConnectionManagerId not found
 Those are mainly caused by the race condition issue in 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3106) Race Condition Issue Fix the order of closing resources when Connection is closed


 [ 
https://issues.apache.org/jira/browse/SPARK-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3106:
--

Description: 
Now, when we run Spark application, error message is appear on driver's log.
The error message includes like as follows.

* message caused by ClosedChannelException
* message caused by CancelledKeyException
* Corresponding SendingConnectionManagerId not found

Those are mainly caused by the race condition issue of the time Connection is 
closed.

  was:
Now, when we run Spark application, error message is appear on driver's log.
The error message includes like as follows.

* message caused by ClosedChannelException
* message caused by CancelledKeyException
* Corresponding SendingConnectionManagerId not found

Those are mainly caused by the race condition issue in 


 *Race Condition Issue* Fix the order of closing resources when Connection is 
 closed
 ---

 Key: SPARK-3106
 URL: https://issues.apache.org/jira/browse/SPARK-3106
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Now, when we run Spark application, error message is appear on driver's log.
 The error message includes like as follows.
 * message caused by ClosedChannelException
 * message caused by CancelledKeyException
 * Corresponding SendingConnectionManagerId not found
 Those are mainly caused by the race condition issue of the time Connection is 
 closed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3106) Race Condition Issue Fix the order of closing resources when Connection is closed


 [ 
https://issues.apache.org/jira/browse/SPARK-3106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3106:
--

Summary: *Race Condition Issue* Fix the order of closing resources when 
Connection is closed  (was: *Race Condition Issue* Fix the order of resources 
in Connection)

 *Race Condition Issue* Fix the order of closing resources when Connection is 
 closed
 ---

 Key: SPARK-3106
 URL: https://issues.apache.org/jira/browse/SPARK-3106
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 Now, when we run Spark application, error message is appear on driver's log.
 The error message includes like as follows.
 * message caused by ClosedChannelException
 * message caused by CancelledKeyException
 * Corresponding SendingConnectionManagerId not found
 Those are mainly caused by the race condition issue in 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3115) Improve task broadcast latency for small tasks

2014-08-19 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102142#comment-14102142
 ] 

Mridul Muralidharan commented on SPARK-3115:


I had a tab open with pretty much exact same bug comments ready to be filed :-)

 Improve task broadcast latency for small tasks
 --

 Key: SPARK-3115
 URL: https://issues.apache.org/jira/browse/SPARK-3115
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Shivaram Venkataraman
Assignee: Reynold Xin

 Broadcasting the task information helps reduce the amount of data transferred 
 for large tasks. However we've seen that this adds more latency for small 
 tasks. It'll be great to profile and fix this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102145#comment-14102145
 ] 

Sean Owen commented on SPARK-3098:
--

Yes I get the same result with Spark 1.0.0 with patches, including the fix for 
SPARK-2043, in standalone mode: 
{code}
Array[(Int, (Long, Long))] = Array((9272040,(13,14)), (9985320,(14,13)), 
(32797680,(24,26)))
{code}

If I change the code above so that the ranges are not overlapping to begin 
with, and remove distinct(), I don't see the issue.
It also goes away if the RDD c is cached.

I would assume distinct() is deterministic, even if it doesn't guarantee an 
ordering. Same with zipWithIndex(). Either those assumptions are wrong, or it 
could be an issue either place.

A quick check says most keys are correct (no mismatch), and the mismatch is 
generally small. This makes me wonder if there's some kind of race condition in 
handing out numbers? I'll look at the code too.

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 I do not know how to reproduce the bug.
 This is the case. When I was in operating 10 billion data by groupByKey. the 
 results error:
 {noformat}
 (4696501, 370568)
 (4696501, 376672)
 (4696501, 374880)
 .
 (4696502, 350264)
 (4696502, 358458)
 (4696502, 398502)
 ..
 {noformat} 
 = 
 {noformat}
 (4696501,ArrayBuffer(350264, 358458, 398502 )), 
 (4696502,ArrayBuffer(376621, ..))
 {noformat}
 code :
 {code}
 val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
 val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
 val favorites = favoritePreferences(sc, favoritePath, periodTime)
 val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
 val peferences= allBehaviors.groupByKey().map { ... } 
 {code}
 spark-defaults.conf:
 {code}
 spark.default.parallelism280
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


 [ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3098:
---

Description: 
{code}
 val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex() 
c.join(c).filter(t = t._2._1 != t._2._2).take(3)
{code}
 = 
{code}
 Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
(36579712,(13,14)))
{code}

  was:
I do not know how to reproduce the bug.
This is the case. When I was in operating 10 billion data by groupByKey. the 
results error:
{noformat}
(4696501, 370568)
(4696501, 376672)
(4696501, 374880)
.
(4696502, 350264)
(4696502, 358458)
(4696502, 398502)
..
{noformat} 
= 
{noformat}
(4696501,ArrayBuffer(350264, 358458, 398502 )), 
(4696502,ArrayBuffer(376621, ..))
{noformat}

code :
{code}
val dealOuts = clickPreferences(sc, dealOutPath, periodTime)
val dealOrders = orderPreferences(sc, dealOrderPath, periodTime)
val favorites = favoritePreferences(sc, favoritePath, periodTime)
val allBehaviors = (dealOrders ++ favorites ++ dealOuts)
val peferences= allBehaviors.groupByKey().map { ... } 
{code}

spark-defaults.conf:
{code}
spark.default.parallelism280
{code}


  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


 [ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3098:
---

Description: 
The reproduce code:
{code}
 val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex() 
c.join(c).filter(t = t._2._1 != t._2._2).take(3)
{code}
 = 
{code}
 Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
(36579712,(13,14)))
{code}

  was:
{code}
 val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct().zipWithIndex() 
c.join(c).filter(t = t._2._1 != t._2._2).take(3)
{code}
 = 
{code}
 Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
(36579712,(13,14)))
{code}


  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3099) Staging Directory is never deleed when we run job with YARN Client Mode


 [ 
https://issues.apache.org/jira/browse/SPARK-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3099:
--

Summary: Staging Directory is never deleed when we run job with YARN Client 
Mode  (was: Add a shutdown hook for cleanup staging directory to 
ExecutorLauncher)

 Staging Directory is never deleed when we run job with YARN Client Mode
 ---

 Key: SPARK-3099
 URL: https://issues.apache.org/jira/browse/SPARK-3099
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 When we run application with YARN Cluster mode, the class 'ApplicationMaster' 
 is used as ApplicationMaster, which has shutdown hook to cleanup stagind 
 directory (~/.sparkStaging).
 But, when we run application with YARN Client mode, the class 
 'ExecutorLauncher' as an ApplicationMaster doesn't cleanup staging directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3101) Missing volatile annotation in ApplicationMaster


 [ 
https://issues.apache.org/jira/browse/SPARK-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3101:
--

Summary: Missing volatile annotation in ApplicationMaster  (was: Flag 
variable in ApplicationMaster should be declared as volatile)

 Missing volatile annotation in ApplicationMaster
 

 Key: SPARK-3101
 URL: https://issues.apache.org/jira/browse/SPARK-3101
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 In ApplicationMaster, a field variable 'isLastAMRetry' is used as a flag but 
 it's not declared as volatile though it's used from multiple threads.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3099) Staging Directory is never deleted when we run job with YARN Client Mode


 [ 
https://issues.apache.org/jira/browse/SPARK-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3099:
--

Summary: Staging Directory is never deleted when we run job with YARN 
Client Mode  (was: Staging Directory is never deleed when we run job with YARN 
Client Mode)

 Staging Directory is never deleted when we run job with YARN Client Mode
 

 Key: SPARK-3099
 URL: https://issues.apache.org/jira/browse/SPARK-3099
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 When we run application with YARN Cluster mode, the class 'ApplicationMaster' 
 is used as ApplicationMaster, which has shutdown hook to cleanup stagind 
 directory (~/.sparkStaging).
 But, when we run application with YARN Client mode, the class 
 'ExecutorLauncher' as an ApplicationMaster doesn't cleanup staging directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3090) Avoid not stopping SparkContext with YARN Client mode


 [ 
https://issues.apache.org/jira/browse/SPARK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3090:
--

Summary:  Avoid not stopping SparkContext with YARN Client mode  (was: Add 
shutdown hook to stop SparkContext for YARN Client mode)

  Avoid not stopping SparkContext with YARN Client mode
 --

 Key: SPARK-3090
 URL: https://issues.apache.org/jira/browse/SPARK-3090
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 When we use YARN Cluster mode, ApplicationMaser register a shutdown hook, 
 stopping SparkContext.
 Thanks to this, SparkContext can stop even if Application forgets to stop 
 SparkContext itself.
 But, unfortunately, YARN Client mode doesn't have such mechanism.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3089) Fix meaningless error message in ConnectionManager


 [ 
https://issues.apache.org/jira/browse/SPARK-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3089:
--

Summary: Fix meaningless error message in ConnectionManager  (was: Make 
error message in ConnectionManager more meaningful)

 Fix meaningless error message in ConnectionManager
 --

 Key: SPARK-3089
 URL: https://issues.apache.org/jira/browse/SPARK-3089
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 When ConnectionManager#removeConnection is invoked and it cannot find 
 SendingConnection to be closed corresponding to a ConnectionManagerId, 
 following message is logged.
 {code}
 logError(Corresponding SendingConnectionManagerId not found)
 {code}
 But, we cannot get which SendingConnectionManagerId is meant from the message.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-08-19 Thread Tarek Elgamal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102179#comment-14102179
 ] 

Tarek Elgamal commented on SPARK-1782:
--

I am interested to try this new svd implementation. Is there an estimate when 
will spark 1.1.0 be officially released ?

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
 Fix For: 1.1.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102187#comment-14102187
 ] 

Guoqiang Li commented on SPARK-3098:


[~srowen]
the following code also has this issue. 
{code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
  (1 to 1).toSeq.map(p = i * 6000 + p)
}.distinct()
c.zip(c).filter(t = t._1 != t._2).take(3)
{code}
 {{distinct}} seems to be the problem. 

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-2753) Is it supposed --archives option in yarn cluster mode to uncompress file?

2014-08-19 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

José Manuel Abuín Mosquera closed SPARK-2753.
-

Resolution: Not a Problem

 Is it supposed --archives option in yarn cluster mode to uncompress file?
 -

 Key: SPARK-2753
 URL: https://issues.apache.org/jira/browse/SPARK-2753
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
 Environment: CentOS release 6.5 (64 bits) and Hadoop 2.2.0
Reporter: José Manuel Abuín Mosquera
  Labels: archives, cache, distributed, yarn

 Hi all,
 this is my first sent issue, I googled and searche dinto the Spark code and 
 arrived here.
 When passing as argument to --archives a tar.gz or a .zip file, Spark uploads 
 it to the distributed cache, but it is not uncompressing it.
 According the documentation, it is supposed to uncompress it, is this a bug??
 Launching command is:
 /opt/spark-1.0.1/bin/spark-submit --class ProlnatSpark --master yarn-cluster 
 --num-executors 32 --driver-library-path /opt/hadoop/hadoop-2.2.0/lib/native/ 
 --driver-memory 390m --executor-memory 890m --executor-cores 1 
 --archives=Diccionarios.tar.gz --verbose ProlnatSpark.jar 
 Wikipedias/WikipediaPlain.txt saidaWikipediaSpark
 In files 
 /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala and 
 /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
  doesn't seem to uncompress the files.
 I hope this helps, thank you very much :)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3120) Local Dirs is not useful in yarn-client mode

2014-08-19 Thread Thomas Graves (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102199#comment-14102199
]

Thomas Graves commented on SPARK-3120:
--

Can you please clarify this? You are trying to set the local-dirs on the
executors? If so then you are specifically not alllowed to do that. on YARN
you should be using the yarn approved directories which properly get managed by
YARN (cleaned up when the application finishes). It automatically just picks
up the yarn configured directories.

Local Dirs is not useful in yarn-client mode

Key: SPARK-3120
URL: https://issues.apache.org/jira/browse/SPARK-3120
Project: Spark
Issue Type: Bug
Components: Spark Core, YARN
Affects Versions: 1.0.2
Environment: Spark 1.0.2
Yarn 2.3.0
Reporter: hzw

I was using spark1.0.2 and hadoop 2.3.0 to run a spark application on yarn.
I was excepted to set the spark.local.dir to separate the shuffle files to
many disks, so I exported LOCAL_DIRS in Spark-env.sh.
But it failed to create the local dirs in my specify path.
It just go to the path in
/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/ as the hadoop
default path.
To reappear this:
1.Do not set the “yarn.nodemanager.local-dirs” in yarn-site.xml which
influence the result.
2.run a job and then find the executor log at the INFO DiskBlockManager:
Created local directory at ..
Inaddtion, I tried to add the exported LOCAL_DIRS in yarn-env.sh. It will
lanch the LOCAL_DIRS value in the ExecutorLancher and it still would be
overwrite by yarn in lanching the executor container.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3072) Yarn AM not always properly exiting after unregistering from RM

2014-08-19 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3072.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

 Yarn AM not always properly exiting after unregistering from RM
 ---

 Key: SPARK-3072
 URL: https://issues.apache.org/jira/browse/SPARK-3072
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical
 Fix For: 1.1.0


 The yarn application master doesn't always exit properly after unregistering 
 from the RM.  
 One way to reproduce is to ask for large containers ( 4g) but use jdk32 so 
 that all of them fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102279#comment-14102279
 ] 

Guoqiang Li commented on SPARK-3098:


this issue caused by the code: 
[BlockFetcherIterator.scala#L221|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala#L221]
 
{noformat}fetchRequests ++= Utils.randomize(remoteRequests){noformat}
 
=[ShuffledRDD.scala#L65|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala#L65]{noformat}SparkEnv.get.shuffleFetcher.fetch[P](shuffledId,
 split.index, context, ser){noformat}
=
[PairRDDFunctions.scala#L100|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L100]
{noformat}
 val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner)
.setSerializer(serializer)
  partitioned.mapPartitionsWithContext((context, iter) = {
new InterruptibleIterator(context, 
aggregator.combineCombinersByKey(iter, context))
  }, preservesPartitioning = true)
{noformat}
=
[PairRDDFunctions.scala#L163|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L163]
{noformat}
  def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] = {
combineByKey[V]((v: V) = v, func, func, partitioner)
  }
{noformat}
=

[RDD.scala#L288|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L288]

{noformat}
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
map(x = (x, null)).reduceByKey((x, y) = x, numPartitions).map(_._1)

{noformat}

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3098) In some cases, operation zipWithIndex get a wrong results


[ 
https://issues.apache.org/jira/browse/SPARK-3098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102279#comment-14102279
 ] 

Guoqiang Li edited comment on SPARK-3098 at 8/19/14 3:02 PM:
-

this issue caused by the code: 

[BlockFetcherIterator.scala#L221|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala#L221]
 
{noformat}fetchRequests ++= Utils.randomize(remoteRequests){noformat}
 
=[ShuffledRDD.scala#L65|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala#L65]{noformat}SparkEnv.get.shuffleFetcher.fetch[P](shuffledId,
 split.index, context, ser){noformat}
=
[PairRDDFunctions.scala#L100|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L100]
{noformat}
 val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner)
.setSerializer(serializer)
  partitioned.mapPartitionsWithContext((context, iter) = {
new InterruptibleIterator(context, 
aggregator.combineCombinersByKey(iter, context))
  }, preservesPartitioning = true)
{noformat}
=
[PairRDDFunctions.scala#L163|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L163]
{noformat}
  def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] = {
combineByKey[V]((v: V) = v, func, func, partitioner)
  }
{noformat}
=

[RDD.scala#L288|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L288]

{noformat}
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
map(x = (x, null)).reduceByKey((x, y) = x, numPartitions).map(_._1)

{noformat}


was (Author: gq):
this issue caused by the code: 
[BlockFetcherIterator.scala#L221|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/storage/BlockFetcherIterator.scala#L221]
 
{noformat}fetchRequests ++= Utils.randomize(remoteRequests){noformat}
 
=[ShuffledRDD.scala#L65|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/ShuffledRDD.scala#L65]{noformat}SparkEnv.get.shuffleFetcher.fetch[P](shuffledId,
 split.index, context, ser){noformat}
=
[PairRDDFunctions.scala#L100|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L100]
{noformat}
 val partitioned = new ShuffledRDD[K, C, (K, C)](combined, partitioner)
.setSerializer(serializer)
  partitioned.mapPartitionsWithContext((context, iter) = {
new InterruptibleIterator(context, 
aggregator.combineCombinersByKey(iter, context))
  }, preservesPartitioning = true)
{noformat}
=
[PairRDDFunctions.scala#L163|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L163]
{noformat}
  def reduceByKey(partitioner: Partitioner, func: (V, V) = V): RDD[(K, V)] = {
combineByKey[V]((v: V) = v, func, func, partitioner)
  }
{noformat}
=

[RDD.scala#L288|https://github.com/apache/spark/blob/v1.0.1/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L288]

{noformat}
  def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] =
map(x = (x, null)).reduceByKey((x, y) = x, numPartitions).map(_._1)

{noformat}

  In some cases, operation zipWithIndex get a wrong results
 --

 Key: SPARK-3098
 URL: https://issues.apache.org/jira/browse/SPARK-3098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Guoqiang Li
Priority: Critical

 The reproduce code:
 {code}
  val c = sc.parallelize(1 to 7899).flatMap { i =
   (1 to 1).toSeq.map(p = i * 6000 + p)
 }.distinct().zipWithIndex() 
 c.join(c).filter(t = t._2._1 != t._2._2).take(3)
 {code}
  = 
 {code}
  Array[(Int, (Long, Long))] = Array((1732608,(11,12)), (45515264,(12,13)), 
 (36579712,(13,14)))
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3122) hadoop-yarn dependencies cannot be resolved


[ 
https://issues.apache.org/jira/browse/SPARK-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102288#comment-14102288
 ] 

Guoqiang Li commented on SPARK-3122:


Why add {{spark-yarn_2.10}} dependency? 

Normally add this to your POM file's dependencies section:
{code:xml}
dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-client/artifactId
  version2.4.0 /version
/dependency
{code}

 hadoop-yarn dependencies cannot be resolved
 ---

 Key: SPARK-3122
 URL: https://issues.apache.org/jira/browse/SPARK-3122
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0, 1.0.2
 Environment: spark 1.0.1 + YARN + hadoop 2.4.0 cluster on linux 
 machines
 client on windows 7
 maven 3.0.4
 maven repository http://repo1.maven.org/maven2/org/apache/hadoop/
Reporter: Ran Levi
  Labels: build, easyfix, hadoop, maven

 When adding spark-yarn_2.10:1.0.2 dependency to java project, other 
 hadoop-yarn-XXX dependencies are needed. Those dependencies are downloaded 
 using version 1.0.4 which does not exist, resulting in build error.
 Version 1.0.4 is taken from hadoop.version variable in spark-parent-1.0.2.pom.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3122) hadoop-yarn dependencies cannot be resolved

2014-08-19 Thread Ran Levi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102300#comment-14102300
 ] 

Ran Levi commented on SPARK-3122:
-

It was my understanding that it is required to create a JavaSparkContext with 
master='yarn-client'.
Was I wrong?

 hadoop-yarn dependencies cannot be resolved
 ---

 Key: SPARK-3122
 URL: https://issues.apache.org/jira/browse/SPARK-3122
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0, 1.0.2
 Environment: spark 1.0.1 + YARN + hadoop 2.4.0 cluster on linux 
 machines
 client on windows 7
 maven 3.0.4
 maven repository http://repo1.maven.org/maven2/org/apache/hadoop/
Reporter: Ran Levi
  Labels: build, easyfix, hadoop, maven

 When adding spark-yarn_2.10:1.0.2 dependency to java project, other 
 hadoop-yarn-XXX dependencies are needed. Those dependencies are downloaded 
 using version 1.0.4 which does not exist, resulting in build error.
 Version 1.0.4 is taken from hadoop.version variable in spark-parent-1.0.2.pom.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.

2014-08-19 Thread uncleGen (JIRA)

uncleGen created SPARK-3123:
---

 Summary: override the setName function to set EdgeRDD's name 
manually just as VertexRDD does.
 Key: SPARK-3123
 URL: https://issues.apache.org/jira/browse/SPARK-3123
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.2, 1.0.0
Reporter: uncleGen
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3123) override the setName function to set EdgeRDD's name manually just as VertexRDD does.


[ 
https://issues.apache.org/jira/browse/SPARK-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102308#comment-14102308
 ] 

Apache Spark commented on SPARK-3123:
-

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2033

 override the setName function to set EdgeRDD's name manually just as 
 VertexRDD does.
 --

 Key: SPARK-3123
 URL: https://issues.apache.org/jira/browse/SPARK-3123
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.0.0, 1.0.2
Reporter: uncleGen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3118) add SHOW TBLPROPERTIES tblname; and SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name] support


[ 
https://issues.apache.org/jira/browse/SPARK-3118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102317#comment-14102317
 ] 

Apache Spark commented on SPARK-3118:
-

User 'u0jing' has created a pull request for this issue:
https://github.com/apache/spark/pull/2034

 add SHOW TBLPROPERTIES tblname; and SHOW COLUMNS (FROM|IN) table_name 
 [(FROM|IN) db_name] support
 -

 Key: SPARK-3118
 URL: https://issues.apache.org/jira/browse/SPARK-3118
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: wangxiaojing
Priority: Minor
   Original Estimate: 12h
  Remaining Estimate: 12h

 The SHOW TBLPROPERTIES tblname; and SHOW COLUMNS (FROM|IN) table_name 
 [(FROM|IN) db_name] syntax had been disabled.
 SHOW COLUMNS shows all the columns in a table including partition columns.
 SHOW TBLPROPERTIES shows Table Properties.
 They all  describe a hive table.
 spark-sql SHOW COLUMNS in test;
 SHOW COLUMNS in test;
 java.lang.RuntimeException: 
 Unsupported language features in query: SHOW COLUMNS in test
 TOK_SHOWCOLUMNS
   TOK_TABNAME
 test
 spark-sql SHOW TBLPROPERTIES  test;
 SHOW TBLPROPERTIES  test;
 java.lang.RuntimeException: 
 Unsupported language features in query: SHOW TBLPROPERTIES  test
 TOK_SHOW_TBLPROPERTIES
   test



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3122) hadoop-yarn dependencies cannot be resolved


[ 
https://issues.apache.org/jira/browse/SPARK-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102318#comment-14102318
 ] 

Guoqiang Li commented on SPARK-3122:


It is not necessary.
Only need to these:
{code:xml}
  dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.4.0/version
  /dependency
  dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.4.0 /version
  /dependency
{code}
Refer to these:
http://spark.apache.org/docs/latest/running-on-yarn.html
http://spark.apache.org/docs/latest/submitting-applications.html
https://github.com/apache/spark/blob/master/README.md#a-note-about-hadoop-versions

 hadoop-yarn dependencies cannot be resolved
 ---

 Key: SPARK-3122
 URL: https://issues.apache.org/jira/browse/SPARK-3122
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0, 1.0.2
 Environment: spark 1.0.1 + YARN + hadoop 2.4.0 cluster on linux 
 machines
 client on windows 7
 maven 3.0.4
 maven repository http://repo1.maven.org/maven2/org/apache/hadoop/
Reporter: Ran Levi
  Labels: build, easyfix, hadoop, maven

 When adding spark-yarn_2.10:1.0.2 dependency to java project, other 
 hadoop-yarn-XXX dependencies are needed. Those dependencies are downloaded 
 using version 1.0.4 which does not exist, resulting in build error.
 Version 1.0.4 is taken from hadoop.version variable in spark-parent-1.0.2.pom.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package


[ 
https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102330#comment-14102330
 ] 

Guoqiang Li commented on SPARK-3124:


What's your command?

 Jar version conflict in the assembly package
 

 Key: SPARK-3124
 URL: https://issues.apache.org/jira/browse/SPARK-3124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the 
 assembly package, however, the class(NioWorker) signature difference leads to 
 the failure in launching sparksql CLI/ThriftServer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3124) Jar version conflict in the assembly package


[ 
https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102330#comment-14102330
 ] 

Guoqiang Li edited comment on SPARK-3124 at 8/19/14 3:42 PM:
-

What's your command?
{{ ./make-distribution.sh  -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive 
-Dhadoop.version=2.3.0 }}  should be no problem.


was (Author: gq):
What's your command?

 Jar version conflict in the assembly package
 

 Key: SPARK-3124
 URL: https://issues.apache.org/jira/browse/SPARK-3124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the 
 assembly package, however, the class(NioWorker) signature difference leads to 
 the failure in launching sparksql CLI/ThriftServer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package


[ 
https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102334#comment-14102334
 ] 

Apache Spark commented on SPARK-3124:
-

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/2035

 Jar version conflict in the assembly package
 

 Key: SPARK-3124
 URL: https://issues.apache.org/jira/browse/SPARK-3124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the 
 assembly package, however, the class(NioWorker) signature difference leads to 
 the failure in launching sparksql CLI/ThriftServer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package

2014-08-19 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102338#comment-14102338
 ] 

Cheng Hao commented on SPARK-3124:
--

Can you try bin/spark-sql after make distribution?

 Jar version conflict in the assembly package
 

 Key: SPARK-3124
 URL: https://issues.apache.org/jira/browse/SPARK-3124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the 
 assembly package, however, the class(NioWorker) signature difference leads to 
 the failure in launching sparksql CLI/ThriftServer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3125) hive thriftserver test suite failure

2014-08-19 Thread wangfei (JIRA)

wangfei created SPARK-3125:
--

 Summary: hive thriftserver test suite failure
 Key: SPARK-3125
 URL: https://issues.apache.org/jira/browse/SPARK-3125
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: wangfei


hive thriftserver test suite failure
1 CliSuite:
[info] - simple commands *** FAILED ***
[info]   java.lang.AssertionError: assertion failed: Didn't find OK in the 
output:
[info]   at scala.Predef$.assert(Predef.scala:179)
[info]   at 
org.apache.spark.sql.hive.thriftserver.TestUtils$class.waitForQuery(TestUtils.scala:70)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite.waitForQuery(CliSuite.scala:26)
[info]   at 
org.apache.spark.sql.hive.thriftserver.TestUtils$class.executeQuery(TestUtils.scala:62)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite.executeQuery(CliSuite.scala:26)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply$mcV$sp(CliSuite.scala:54)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:52)
[info]   at 
org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:52)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
[info]   at 
org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
[info]   ...

2.HiveThriftServer2Suite

- test query execution against a Hive Thrift server *** FAILED ***
[info]   java.sql.SQLException: Could not open connection to 
jdbc:hive2://localhost:41419/: java.net.ConnectException: Connection refused
[info]   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146)
[info]   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123)
[info]   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:571)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:215)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:152)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:155)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply$mcV$sp(HiveThriftServer2Suite.scala:113)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:110)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite$$anonfun$1.apply(HiveThriftServer2Suite.scala:110)
[info]   ...
[info]   Cause: org.apache.thrift.transport.TTransportException: 
java.net.ConnectException: Connection refused
[info]   at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
[info]   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
[info]   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
[info]   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144)
[info]   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123)
[info]   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:571)
[info]   at java.sql.DriverManager.getConnection(DriverManager.java:215)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.getConnection(HiveThriftServer2Suite.scala:152)
[info]   at 
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2Suite.createStatement(HiveThriftServer2Suite.scala:155)
[info]   ...
[info]   Cause: java.net.ConnectException: Connection refused
[info]   at java.net.PlainSocketImpl.socketConnect(Native Method)
[info]   at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
[info]   at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
[info]   at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
[info]   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
[info]   at java.net.Socket.connect(Socket.java:579)
[info]   at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
[info]   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:248)
[info]   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
[info]   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:144)
[info]   ...
[info] - SPARK-3004 regression: result set containing NULL *** FAILED ***
[info]   java.sql.SQLException: Could not open connection to 
jdbc:hive2://localhost:41419/: java.net.ConnectException: Connection refused
[info]   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146)
[info]   at org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123)
[info]

[jira] [Created] (SPARK-3126) HiveThriftServer2Suite hangs

2014-08-19 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3126:
-

 Summary: HiveThriftServer2Suite hangs
 Key: SPARK-3126
 URL: https://issues.apache.org/jira/browse/SPARK-3126
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian
Priority: Blocker
 Fix For: 1.0.1, 1.0.2


[PR #1851|https://github.com/apache/spark/pull/1851] modified 
{{sbin/start-thriftserver.sh}}, added proper quotation and removed {{eval}}, 
but {{HiveThriftServer2Suite}}, which invokes {{sbin/start-thriftserver.sh}}, 
was not updated accordingly. The JDBC URL command line option shouldn't be 
quoted after removing {{eval}} from the script, otherwise, the following wrong 
command will be issued (notice the unclosed double quote):
{code}
../../sbin/start-thriftserver.sh ... --hiveconf 
javax.jdo.option.ConnectionURL=xxx ...
{code}
This makes test cases hang until time out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package


[ 
https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102344#comment-14102344
 ] 

Guoqiang Li commented on SPARK-3124:



We should modify the file sql/hive-thriftserver/pom.xml
{code:xml}
dependency
  groupIdorg.spark-project.hive/groupId
  artifactIdhive-cli/artifactId
  version${hive.version}/version
  exclusions
exclusion
  groupIdorg.jboss.netty/groupId
  artifactIdnetty/artifactId
/exclusion
  /exclusions
/dependency
{code}

 Jar version conflict in the assembly package
 

 Key: SPARK-3124
 URL: https://issues.apache.org/jira/browse/SPARK-3124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the 
 assembly package, however, the class(NioWorker) signature difference leads to 
 the failure in launching sparksql CLI/ThriftServer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3127) Modifying Spark SQL related scripts should trigger Spark SQL test suites

2014-08-19 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3127:
-

 Summary: Modifying Spark SQL related scripts should trigger Spark 
SQL test suites
 Key: SPARK-3127
 URL: https://issues.apache.org/jira/browse/SPARK-3127
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.0.1
Reporter: Cheng Lian


Currently only modifying files under {{sql/}} triggers execution of Spark SQL 
test suites, {{bin/spark-sql}} and {{sbin/start-thriftserver.sh}} are not 
included. This is an indirect cause of SPARK-3126.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3120) Local Dirs is not useful in yarn-client mode

2014-08-19 Thread hzw (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102350#comment-14102350
 ] 

hzw commented on SPARK-3120:


Do you mean that : If I want to change the local-dirs in Yarn Mode, I must 
restart the cluster since the configuration was cached in memory?
And I was also confused where to set the LOCAL_DIRS as the Spark Configuration 
said NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS 
(Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the 
cluster manager.“

 Local Dirs is not useful in yarn-client mode
 

 Key: SPARK-3120
 URL: https://issues.apache.org/jira/browse/SPARK-3120
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.0.2
 Environment: Spark 1.0.2
 Yarn 2.3.0
Reporter: hzw

 I was using spark1.0.2 and hadoop 2.3.0 to run a spark application on yarn.
 I was excepted to set the spark.local.dir to separate  the shuffle files to 
 many disks, so I exported LOCAL_DIRS in Spark-env.sh.
 But it failed to create the local dirs in my specify path.
 It just go to the path in 
 /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/ as the hadoop 
 default path.
 To reappear this:
 1.Do not set the “yarn.nodemanager.local-dirs” in yarn-site.xml which 
 influence the result.
 2.run a job and then find the executor log at the INFO DiskBlockManager: 
 Created local directory at ..
 Inaddtion, I tried to add the exported LOCAL_DIRS in yarn-env.sh. It will 
 lanch the LOCAL_DIRS value in the ExecutorLancher and it still would be 
 overwrite by yarn in lanching the executor container.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3124) Jar version conflict in the assembly package

2014-08-19 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102352#comment-14102352
 ] 

Cheng Hao commented on SPARK-3124:
--

Yes, actually I did in the PR.

 Jar version conflict in the assembly package
 

 Key: SPARK-3124
 URL: https://issues.apache.org/jira/browse/SPARK-3124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Priority: Blocker

 Both netty-3.2.2.Final.jar and netty-3.6.6.Final.jar are flatten into the 
 assembly package, however, the class(NioWorker) signature difference leads to 
 the failure in launching sparksql CLI/ThriftServer.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2929) Rewrite HiveThriftServer2Suite and CliSuite

2014-08-19 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102370#comment-14102370
 ] 

Cheng Lian commented on SPARK-2929:
---

Opened SPARK-3126  SPARK-3127 to track failure of these test suites more 
precisely. [PR #2036|https://github.com/apache/spark/pull/2036] was submitted 
to fix both of these two issues.

Set SPARK-3126 as blocker and restored this one to major.

 Rewrite HiveThriftServer2Suite and CliSuite
 ---

 Key: SPARK-2929
 URL: https://issues.apache.org/jira/browse/SPARK-2929
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian
Assignee: Cheng Lian

 {{HiveThriftServer2Suite}} and {{CliSuite}} were inherited from Shark and 
 contain too may hard coded timeouts and timing assumptions when doing IPC. 
 This makes these tests both flaky and slow.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3127) Modifying Spark SQL related scripts should trigger Spark SQL test suites


[ 
https://issues.apache.org/jira/browse/SPARK-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102373#comment-14102373
 ] 

Apache Spark commented on SPARK-3127:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2036

 Modifying Spark SQL related scripts should trigger Spark SQL test suites
 

 Key: SPARK-3127
 URL: https://issues.apache.org/jira/browse/SPARK-3127
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian

 Currently only modifying files under {{sql/}} triggers execution of Spark SQL 
 test suites, {{bin/spark-sql}} and {{sbin/start-thriftserver.sh}} are not 
 included. This is an indirect cause of SPARK-3126.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3125) hive thriftserver test suite failure

2014-08-19 Thread wangfei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102424#comment-14102424
 ] 

wangfei commented on SPARK-3125:


for clisuite i print the error info, as follows:
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.

Logging initialized using configuration in 
jar:file:/home/wf/code/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar!/hive-log4j.properties
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to 
instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:301)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:271)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
at 
org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:359)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:359)
at 
org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:103)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:58)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:291)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:413)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:226)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:314)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


 hive thriftserver test suite failure
 

 Key: SPARK-3125
 URL: https://issues.apache.org/jira/browse/SPARK-3125
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: wangfei

 hive thriftserver test suite failure
 1 CliSuite:
 [info] - simple commands *** FAILED ***
 [info]   java.lang.AssertionError: assertion failed: Didn't find OK in the 
 output:
 [info]   at scala.Predef$.assert(Predef.scala:179)
 [info]   at 
 org.apache.spark.sql.hive.thriftserver.TestUtils$class.waitForQuery(TestUtils.scala:70)
 [info]   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite.waitForQuery(CliSuite.scala:26)
 [info]   at 
 org.apache.spark.sql.hive.thriftserver.TestUtils$class.executeQuery(TestUtils.scala:62)
 [info]   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite.executeQuery(CliSuite.scala:26)
 [info]   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply$mcV$sp(CliSuite.scala:54)
 [info]   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:52)
 [info]   at 
 org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$1.apply(CliSuite.scala:52)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
 [info]   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
 [info]   ...
 2.HiveThriftServer2Suite
 - test query execution against a Hive Thrift server *** FAILED ***
 [info]   java.sql.SQLException: Could not open connection to 
 jdbc:hive2://localhost:41419/: java.net.ConnectException: Connection refused
 [info]   at 
 org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:146)
 [info]   at 
 org.apache.hive.jdbc.HiveConnection.init(HiveConnection.java:123)
 [info]   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
 [info]   at

[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK


[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102427#comment-14102427
 ] 

Xiangrui Meng commented on SPARK-1782:
--

The plan is to release v1.1 by the end of the month. The feature is available 
in both master and branch-1.1. You can also checkout the current snapshot and 
have a try.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
 Fix For: 1.1.0

   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3128) Use streaming test suite for StreamingLR

2014-08-19 Thread Jeremy Freeman (JIRA)

Jeremy Freeman created SPARK-3128:
-

 Summary: Use streaming test suite for StreamingLR
 Key: SPARK-3128
 URL: https://issues.apache.org/jira/browse/SPARK-3128
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Priority: Minor


Unit tests for Streaming Linear Regression currently use file writing to 
generate input data and a TextFileStream to read the data. It would be better 
to use existing utilities from the streaming test suite to simulate DStreams 
and collect and evaluate results of DStream operations. This will make tests 
faster, simpler, and easier to maintain / extend.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3129) Prevent data loss in Spark Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan updated SPARK-3129:


Issue Type: New Feature  (was: Bug)

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3129) Prevent data loss in Spark Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan updated SPARK-3129:


Attachment: StreamingPreventDataLoss.pdf

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3122) hadoop-yarn dependencies cannot be resolved

[
https://issues.apache.org/jira/browse/SPARK-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102472#comment-14102472
]

Sean Owen commented on SPARK-3122:
--

[~gq] You do not need to depend on hadoop-client for Spark's sake. You may need
to depend on it because your code uses Hadoop directly. Again if you include
hadoop-client, it should be provided. Your example shows a compile time
dependency, and duplicates a single dependency.

[~ran.l...@hp.com] spark-yarn is really the 'server side' implementation. You
depend on spark-core. It should be a provided dependency. The Spark artifacts
published to Maven happens to reference Hadoop 1.x but it will not matter for
your build, because you would not inherit its dependencies. All you care about
is Spark's API, which does not vary with Hadoop version.

hadoop-yarn dependencies cannot be resolved
---

Key: SPARK-3122
URL: https://issues.apache.org/jira/browse/SPARK-3122
Project: Spark
Issue Type: Bug
Components: YARN
Affects Versions: 1.0.0, 1.0.2
Environment: spark 1.0.1 + YARN + hadoop 2.4.0 cluster on linux
machines
client on windows 7
maven 3.0.4
maven repository http://repo1.maven.org/maven2/org/apache/hadoop/
Reporter: Ran Levi
Labels: build, easyfix, hadoop, maven

When adding spark-yarn_2.10:1.0.2 dependency to java project, other
hadoop-yarn-XXX dependencies are needed. Those dependencies are downloaded
using version 1.0.4 which does not exist, resulting in build error.
Version 1.0.4 is taken from hadoop.version variable in spark-parent-1.0.2.pom.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3130) Should not allow negative values in naive Bayes

Xiangrui Meng created SPARK-3130:


 Summary: Should not allow negative values in naive Bayes
 Key: SPARK-3130
 URL: https://issues.apache.org/jira/browse/SPARK-3130
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


because NB treats feature values as term frequencies.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3110) Add a ha mode in YARN mode to keep executors in between restarts


 [ 
https://issues.apache.org/jira/browse/SPARK-3110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Shreedharan updated SPARK-3110:


Issue Type: Sub-task  (was: Bug)
Parent: SPARK-3129

 Add a ha mode in YARN mode to keep executors in between restarts
 --

 Key: SPARK-3110
 URL: https://issues.apache.org/jira/browse/SPARK-3110
 Project: Spark
  Issue Type: Sub-task
Reporter: Hari Shreedharan

 The idea is for long running processes like streaming, you'd want the AM to 
 come back up and reuse the same executors, so it can get the blocks from the 
 memory of the executors because many streaming systems like Flume cannot 
 really replay the data once it has been taken out. Even for others which can, 
 the time period before data expires can mean some data could be lost. This 
 is the first step in a series of patches for this one. The next is to get the 
 AM to find the executors. My current plan is to use HDFS to keep track of 
 where the executors are running and then communicate to them via Akka, to get 
 a block list.
 I plan to expose this via SparkSubmit as the last step once we have all of 
 the other pieces in place.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3128) Use streaming test suite for StreamingLR


[ 
https://issues.apache.org/jira/browse/SPARK-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102481#comment-14102481
 ] 

Apache Spark commented on SPARK-3128:
-

User 'freeman-lab' has created a pull request for this issue:
https://github.com/apache/spark/pull/2037

 Use streaming test suite for StreamingLR
 

 Key: SPARK-3128
 URL: https://issues.apache.org/jira/browse/SPARK-3128
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Priority: Minor

 Unit tests for Streaming Linear Regression currently use file writing to 
 generate input data and a TextFileStream to read the data. It would be better 
 to use existing utilities from the streaming test suite to simulate DStreams 
 and collect and evaluate results of DStream operations. This will make tests 
 faster, simpler, and easier to maintain / extend.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3089) Fix meaningless error message in ConnectionManager


 [ 
https://issues.apache.org/jira/browse/SPARK-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3089.
---

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Kousuke Saruta

 Fix meaningless error message in ConnectionManager
 --

 Key: SPARK-3089
 URL: https://issues.apache.org/jira/browse/SPARK-3089
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
 Fix For: 1.1.0


 When ConnectionManager#removeConnection is invoked and it cannot find 
 SendingConnection to be closed corresponding to a ConnectionManagerId, 
 following message is logged.
 {code}
 logError(Corresponding SendingConnectionManagerId not found)
 {code}
 But, we cannot get which SendingConnectionManagerId is meant from the message.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3130) Should not allow negative values in naive Bayes


[ 
https://issues.apache.org/jira/browse/SPARK-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102499#comment-14102499
 ] 

Apache Spark commented on SPARK-3130:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/2038

 Should not allow negative values in naive Bayes
 ---

 Key: SPARK-3130
 URL: https://issues.apache.org/jira/browse/SPARK-3130
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 because NB treats feature values as term frequencies.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102501#comment-14102501
 ] 

Hari Shreedharan commented on SPARK-3129:
-

This doc is an early list of fixes. I may have missed some, and/or they may be 
better ways to do this. Please post any feedback you have! Thanks!

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-08-19 Thread Thomas Graves (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102504#comment-14102504
]

Thomas Graves commented on SPARK-3129:
--

A couple of random thoughts on this for yarn. yarn added this ability in 2.4.0
and you have to tell it you want it in the application submission context. So
you will have to handle other versions of yarn properly where its not supported.
I believe yarn will tell you what nodes you have containers already running on
but you'll have to figure out details about ports, etc. I haven't looked at all
the specifics.

You'll have to figure out how to do authentication properly. This gets
forgotten about many times.

I think we should flush out more of the high level design concerns between
yarn/standalone/mesos and on yarn the client/cluster modes.

Prevent data loss in Spark Streaming

Key: SPARK-3129
URL: https://issues.apache.org/jira/browse/SPARK-3129
Project: Spark
Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
Attachments: StreamingPreventDataLoss.pdf

Spark Streaming can small amounts of data when the driver goes down - and the
sending system cannot re-send the data (or the data has already expired on
the sender side). The document attached has more details.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102579#comment-14102579
 ] 

Hari Shreedharan commented on SPARK-3129:
-

The way the driver finds the executors would be common for all the scheduling 
systems (it should really be independent of the scheduling/deployment). I agree 
about the auth part too. 

[~tdas] mentioned there is something similar already in standalone. I'd like to 
concentrate on YARN - if someone else is interested in Mesos please feel free 
to take it up!

I posted an initial patch for Client mode to simply keep the executors around 
(though it is not exposed via SparkSubmit which we can do once we can get the 
whole series of patches in). 

For YARN mode, does that mean the method calls have to be via reflection? I'd 
assume so. 

The reason I mentioned doing it via HDFS and then pinging the executors is to 
make it independent of YARN/Mesos/Standalone - we can just do it via 
StreamingContext and make it completely independent of the backend on which 
Spark is running (I am not even sure this should be a valid option for 
non-streaming cases, as it does not really add any value elsewhere).

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3131) Allow user to set parquet compression codec for writing ParquetFile in SQLContext


 [ 
https://issues.apache.org/jira/browse/SPARK-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Qiu updated SPARK-3131:


Summary: Allow user to set parquet compression codec for writing 
ParquetFile in SQLContext  (was: Allow user to set parquet compression codec)

 Allow user to set parquet compression codec for writing ParquetFile in 
 SQLContext
 -

 Key: SPARK-3131
 URL: https://issues.apache.org/jira/browse/SPARK-3131
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Teng Qiu

 There are 4 different compression codec available for ParquetOutputFormat
 currently it was set as a hard-coded value in 
 {code}ParquetRelation.defaultCompression{code}
 original discuss:
 https://github.com/apache/spark/pull/195#discussion-diff-11002083
 so we need to add a new config property in SQLConf to allow user change this 
 compression codec, and i used similar short names syntax as described in 
 SPARK-2953
 btw, which codec should we use as default? it was set to GZIP 
 (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we 
 should change this to SNAPPY, since SNAPPY is already the default codec for 
 shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec 
 natively.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3131) Allow user to set parquet compression codec for writing ParquetFile in SQLContext


 [ 
https://issues.apache.org/jira/browse/SPARK-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Qiu updated SPARK-3131:


Description: 
There are 4 different compression codec available for ParquetOutputFormat

in Spark SQL it was set as a hard-coded value in 
{code}ParquetRelation.defaultCompression{code}

original discuss:
https://github.com/apache/spark/pull/195#discussion-diff-11002083


so we need to add a new config property in SQLConf to allow user to change this 
compression codec, and i used similar short names syntax as described in 
SPARK-2953


btw, which codec should we use as default? it was set to GZIP 
(https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we 
should change this to SNAPPY, since SNAPPY is already the default codec for 
shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec 
natively.

  was:
There are 4 different compression codec available for ParquetOutputFormat

currently it was set as a hard-coded value in 
{code}ParquetRelation.defaultCompression{code}

original discuss:
https://github.com/apache/spark/pull/195#discussion-diff-11002083


so we need to add a new config property in SQLConf to allow user change this 
compression codec, and i used similar short names syntax as described in 
SPARK-2953


btw, which codec should we use as default? it was set to GZIP 
(https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we 
should change this to SNAPPY, since SNAPPY is already the default codec for 
shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec 
natively.


 Allow user to set parquet compression codec for writing ParquetFile in 
 SQLContext
 -

 Key: SPARK-3131
 URL: https://issues.apache.org/jira/browse/SPARK-3131
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Teng Qiu

 There are 4 different compression codec available for ParquetOutputFormat
 in Spark SQL it was set as a hard-coded value in 
 {code}ParquetRelation.defaultCompression{code}
 original discuss:
 https://github.com/apache/spark/pull/195#discussion-diff-11002083
 so we need to add a new config property in SQLConf to allow user to change 
 this compression codec, and i used similar short names syntax as described in 
 SPARK-2953
 btw, which codec should we use as default? it was set to GZIP 
 (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we 
 should change this to SNAPPY, since SNAPPY is already the default codec for 
 shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec 
 natively.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3131) Allow user to set parquet compression codec for writing ParquetFile in SQLContext


[ 
https://issues.apache.org/jira/browse/SPARK-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102724#comment-14102724
 ] 

Apache Spark commented on SPARK-3131:
-

User 'chutium' has created a pull request for this issue:
https://github.com/apache/spark/pull/2039

 Allow user to set parquet compression codec for writing ParquetFile in 
 SQLContext
 -

 Key: SPARK-3131
 URL: https://issues.apache.org/jira/browse/SPARK-3131
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Teng Qiu

 There are 4 different compression codec available for ParquetOutputFormat
 in Spark SQL it was set as a hard-coded value in 
 {code}ParquetRelation.defaultCompression{code}
 original discuss:
 https://github.com/apache/spark/pull/195#discussion-diff-11002083
 so we need to add a new config property in SQLConf to allow user to change 
 this compression codec, and i used similar short names syntax as described in 
 SPARK-2953
 btw, which codec should we use as default? it was set to GZIP 
 (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we 
 should change this to SNAPPY, since SNAPPY is already the default codec for 
 shuffling in spark-core (SPARK-2469), and parquet-mr supports Snappy codec 
 natively.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3117) Avoid serialization for TorrentBroadcast blocks


[ 
https://issues.apache.org/jira/browse/SPARK-3117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102758#comment-14102758
 ] 

Reynold Xin commented on SPARK-3117:


This is going to be fixed by https://github.com/apache/spark/pull/2030

 Avoid serialization for TorrentBroadcast blocks
 ---

 Key: SPARK-3117
 URL: https://issues.apache.org/jira/browse/SPARK-3117
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 TorrentBroadcast uses a bunch of wrapper objects and MEMORY_AND_DISK storage 
 level to store the torrent blocks. I don't think those are necessary. We can 
 probably get rid of them completely to store everything in serialized form. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3132) Avoid serialization for Array[Byte] in TorrentBroadcast

Reynold Xin created SPARK-3132:
--

 Summary: Avoid serialization for Array[Byte] in TorrentBroadcast
 Key: SPARK-3132
 URL: https://issues.apache.org/jira/browse/SPARK-3132
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


If the input data is a byte array, we should allow TorrentBroadcast to skip 
serializing and compressing the input.

To do this, we should add a new parameter (shortCircuitByteArray) to 
TorrentBroadcast, and then avoid serialization in if the input is byte array 
and shortCircuitByteArray is true.

We should then also do compression in task serialization itself instead of 
doing it in TorrentBroadcast.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3133) Piggyback get location RPC call to fetch small blocks

Reynold Xin created SPARK-3133:
--

 Summary: Piggyback get location RPC call to fetch small blocks
 Key: SPARK-3133
 URL: https://issues.apache.org/jira/browse/SPARK-3133
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


We should add a new API to the BlockManagerMasterActor to get location or the 
data block directly if the data block is small.

Once we use this, this effectively makes TorrentBroadcast behaves similarly to 
HttpBroadcast.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3134) Update block locations asynchronously in TorrentBroadcast

Reynold Xin created SPARK-3134:
--

 Summary: Update block locations asynchronously in TorrentBroadcast
 Key: SPARK-3134
 URL: https://issues.apache.org/jira/browse/SPARK-3134
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


Once the TorrentBroadcast gets the data blocks, it needs to tell the master the 
new location. We should make the location update non-blocking to reduce 
roundtrips we need to launch tasks.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization

Reynold Xin created SPARK-3135:
--

 Summary: Avoid memory copy in TorrentBroadcast serialization
 Key: SPARK-3135
 URL: https://issues.apache.org/jira/browse/SPARK-3135
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


TorrentBroadcast uses a ByteArrayOutputStream to serialize broadcast object 
into a single giant byte array, and then separates it into smaller chunks.  We 
should implement a new OutputStream that writes serialized bytes directly into 
chunks of byte arrays so we don't need the extra memory copy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3136) create java-friendly methods in RandomRDDs

Xiangrui Meng created SPARK-3136:


 Summary: create java-friendly methods in RandomRDDs
 Key: SPARK-3136
 URL: https://issues.apache.org/jira/browse/SPARK-3136
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Though we don't use default argument for methods in RandomRDDs, it is still not 
easy for Java users to use because the output type is either `RDD[Double]` or 
`RDD[Vector]`. Java users should expect `JavaDoubleRDD` and `JavaRDD[Vector]`, 
respectively. We should create dedicated methods for Java users, and allow 
default arguments in Scala methods in RandomRDDs, to make life easier for both 
Java and Scala users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization


 [ 
https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3135:
---

Description: TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream 
to serialize broadcast object into a single giant byte array, and then 
separates it into smaller chunks.  We should implement a new OutputStream that 
writes serialized bytes directly into chunks of byte arrays so we don't need 
the extra memory copy.  (was: TorrentBroadcast uses a ByteArrayOutputStream to 
serialize broadcast object into a single giant byte array, and then separates 
it into smaller chunks.  We should implement a new OutputStream that writes 
serialized bytes directly into chunks of byte arrays so we don't need the extra 
memory copy.)

 Avoid memory copy in TorrentBroadcast serialization
 ---

 Key: SPARK-3135
 URL: https://issues.apache.org/jira/browse/SPARK-3135
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
  Labels: starter

 TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize 
 broadcast object into a single giant byte array, and then separates it into 
 smaller chunks.  We should implement a new OutputStream that writes 
 serialized bytes directly into chunks of byte arrays so we don't need the 
 extra memory copy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3135) Avoid memory copy in TorrentBroadcast serialization


 [ 
https://issues.apache.org/jira/browse/SPARK-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3135:
---

Labels: starter  (was: )

 Avoid memory copy in TorrentBroadcast serialization
 ---

 Key: SPARK-3135
 URL: https://issues.apache.org/jira/browse/SPARK-3135
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin
  Labels: starter

 TorrentBroadcast.blockifyObject uses a ByteArrayOutputStream to serialize 
 broadcast object into a single giant byte array, and then separates it into 
 smaller chunks.  We should implement a new OutputStream that writes 
 serialized bytes directly into chunks of byte arrays so we don't need the 
 extra memory copy.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3133) Piggyback get location RPC call to fetch small blocks


 [ 
https://issues.apache.org/jira/browse/SPARK-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3133:
---

Description: 
We should add a new API to the BlockManagerMasterActor to get location or the 
data block directly if the data block is small.

This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast.


  was:
We should add a new API to the BlockManagerMasterActor to get location or the 
data block directly if the data block is small.

Once we use this, this effectively makes TorrentBroadcast behaves similarly to 
HttpBroadcast.



 Piggyback get location RPC call to fetch small blocks
 -

 Key: SPARK-3133
 URL: https://issues.apache.org/jira/browse/SPARK-3133
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin

 We should add a new API to the BlockManagerMasterActor to get location or the 
 data block directly if the data block is small.
 This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3128) Use streaming test suite for StreamingLR

2014-08-19 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-3128.
--

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.0

 Use streaming test suite for StreamingLR
 

 Key: SPARK-3128
 URL: https://issues.apache.org/jira/browse/SPARK-3128
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Priority: Minor
 Fix For: 1.1.0, 1.2.0


 Unit tests for Streaming Linear Regression currently use file writing to 
 generate input data and a TextFileStream to read the data. It would be better 
 to use existing utilities from the streaming test suite to simulate DStreams 
 and collect and evaluate results of DStream operations. This will make tests 
 faster, simpler, and easier to maintain / extend.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3133) Piggyback get location RPC call to fetch small blocks


 [ 
https://issues.apache.org/jira/browse/SPARK-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3133:
---

Description: 
We should add a new API to the BlockManagerMasterActor to get location or the 
data block directly if the data block is small.

This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast for 
small blocks.


  was:
We should add a new API to the BlockManagerMasterActor to get location or the 
data block directly if the data block is small.

This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast.



 Piggyback get location RPC call to fetch small blocks
 -

 Key: SPARK-3133
 URL: https://issues.apache.org/jira/browse/SPARK-3133
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin

 We should add a new API to the BlockManagerMasterActor to get location or the 
 data block directly if the data block is small.
 This effectively makes TorrentBroadcast behaves similarly to HttpBroadcast 
 for small blocks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3137) Use finer grained locking in TorrentBroadcast.readObject

Reynold Xin created SPARK-3137:
--

 Summary: Use finer grained locking in TorrentBroadcast.readObject
 Key: SPARK-3137
 URL: https://issues.apache.org/jira/browse/SPARK-3137
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


TorrentBroadcast.readObject uses a global lock so only one task can be fetching 
the blocks at the same time.

This is not optimal if we are running multiple stages concurrently because they 
should be able to independently fetch their own blocks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3137) Use finer grained locking in TorrentBroadcast.readObject


 [ 
https://issues.apache.org/jira/browse/SPARK-3137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3137:
---

 Component/s: Spark Core
Target Version/s: 1.2.0

 Use finer grained locking in TorrentBroadcast.readObject
 

 Key: SPARK-3137
 URL: https://issues.apache.org/jira/browse/SPARK-3137
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin

 TorrentBroadcast.readObject uses a global lock so only one task can be 
 fetching the blocks at the same time.
 This is not optimal if we are running multiple stages concurrently because 
 they should be able to independently fetch their own blocks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3136) create java-friendly methods in RandomRDDs


[ 
https://issues.apache.org/jira/browse/SPARK-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102791#comment-14102791
 ] 

Apache Spark commented on SPARK-3136:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/2041

 create java-friendly methods in RandomRDDs
 --

 Key: SPARK-3136
 URL: https://issues.apache.org/jira/browse/SPARK-3136
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Though we don't use default argument for methods in RandomRDDs, it is still 
 not easy for Java users to use because the output type is either 
 `RDD[Double]` or `RDD[Vector]`. Java users should expect `JavaDoubleRDD` and 
 `JavaRDD[Vector]`, respectively. We should create dedicated methods for Java 
 users, and allow default arguments in Scala methods in RandomRDDs, to make 
 life easier for both Java and Scala users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2333) spark_ec2 script should allow option for existing security group


 [ 
https://issues.apache.org/jira/browse/SPARK-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2333.
---

Resolution: Fixed

 spark_ec2 script should allow option for existing security group
 

 Key: SPARK-2333
 URL: https://issues.apache.org/jira/browse/SPARK-2333
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.0.0
Reporter: Kam Kasravi
Priority: Minor
 Fix For: 1.1.0


 spark-ec2 will create a new security with hardcoded attributes - an option 
 --use_group should be provided to use an existing security group



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2333) spark_ec2 script should allow option for existing security group


 [ 
https://issues.apache.org/jira/browse/SPARK-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2333:
--

Issue Type: Improvement  (was: Bug)

 spark_ec2 script should allow option for existing security group
 

 Key: SPARK-2333
 URL: https://issues.apache.org/jira/browse/SPARK-2333
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.0.0
Reporter: Kam Kasravi
Priority: Minor
 Fix For: 1.1.0


 spark-ec2 will create a new security with hardcoded attributes - an option 
 --use_group should be provided to use an existing security group



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2839) Documentation for statistical functions


 [ 
https://issues.apache.org/jira/browse/SPARK-2839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2839:
-

Assignee: Burak Yavuz

 Documentation for statistical functions
 ---

 Key: SPARK-2839
 URL: https://issues.apache.org/jira/browse/SPARK-2839
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Burak Yavuz

 Add documentation and code examples for statistical functions to MLlib's 
 programming guide.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3112) Documentation for Streaming Logistic Regression Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-3112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3112:
-

Assignee: Jeremy Freeman

 Documentation for Streaming Logistic Regression Streaming
 -

 Key: SPARK-3112
 URL: https://issues.apache.org/jira/browse/SPARK-3112
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Ameet Talwalkar
Assignee: Jeremy Freeman





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2790) PySpark zip() doesn't work properly if RDDs have different serializers


 [ 
https://issues.apache.org/jira/browse/SPARK-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2790.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

 PySpark zip() doesn't work properly if RDDs have different serializers
 --

 Key: SPARK-2790
 URL: https://issues.apache.org/jira/browse/SPARK-2790
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.1.0
Reporter: Josh Rosen
Assignee: Davies Liu
Priority: Critical
 Fix For: 1.1.0


 In PySpark, attempting to {{zip()}} two RDDs may fail if the RDDs have 
 different serializers (e.g. batched vs. unbatched), even if those RDDs have 
 the same number of partitions and same numbers of elements.  This problem 
 occurs in the MLlib Python APIs, where we might want to zip a JavaRDD of 
 LabelledPoints with a JavaRDD of batch-serialized Python objects.
 This is problematic because whether zip() succeeds or errors depends on the 
 partitioning / batching strategy, and we don't want to surface the 
 serialization details to users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3138) sqlContext.parquetFile should be able to take a single file as parameter

Teng Qiu created SPARK-3138:
---

 Summary: sqlContext.parquetFile should be able to take a single 
file as parameter
 Key: SPARK-3138
 URL: https://issues.apache.org/jira/browse/SPARK-3138
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Teng Qiu


http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-parquetFile-path-fails-if-path-is-a-file-but-succeeds-if-a-directory-tp12345.html

to reproduce this issue in spark-shell
{code:java}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext._
import org.apache.hadoop.fs.{FileSystem, Path}

case class TestRDDEntry(key: Int, value: String)

val path = /tmp/parquet_test
sc.parallelize((1 to 100)).map(i = TestRDDEntry(i, 
sval_$i)).coalesce(1).saveAsParquetFile(path)

val fsPath = new Path(path)
val fs: FileSystem = fsPath.getFileSystem(sc.hadoopConfiguration)
val children = 
fs.listStatus(fsPath).filter(_.getPath.getName.endsWith(.parquet))

val readFile = sqlContext.parquetFile(path + / + children(0).getPath.getName)
{code}

it throws exception:

{code}
java.lang.IllegalArgumentException: Expected 
file:/tmp/parquet_test/part-r-1.parquet for be a directory with Parquet 
files/metadata
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:374)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:414)
at 
org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:66)
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3139) Akka timeouts from ContextCleaner when cleaning shuffles

Josh Rosen created SPARK-3139:
-

 Summary: Akka timeouts from ContextCleaner when cleaning shuffles
 Key: SPARK-3139
 URL: https://issues.apache.org/jira/browse/SPARK-3139
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
 Environment: 10 r3.2xlarge tests on EC2, running the 
scala-agg-by-key-int spark-perf test against master commit 
d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb.
Reporter: Josh Rosen
Priority: Blocker


When running spark-perf tests on EC2, I have a job that's consistently logging 
the following Akka exceptions:

{code}
4/08/19 22:07:12 ERROR spark.ContextCleaner: Error cleaning shuffle 0
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
  at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
  at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
  at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
  at scala.concurrent.Await$.result(package.scala:107)
  at 
org.apache.spark.storage.BlockManagerMaster.removeShuffle(BlockManagerMaster.scala:118)
  at org.apache.spark.ContextCleaner.doCleanupShuffle(ContextCleaner.scala:159)
  at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:131)
  at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:124)
  at scala.Option.foreach(Option.scala:236)
  at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:124)
  at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120)
  at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:120)
  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1252)
  at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:119)
  at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
{code}

and

{code}
14/08/19 22:07:12 ERROR storage.BlockManagerMaster: Failed to remove shuffle 0
akka.pattern.AskTimeoutException: Timed out
  at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
  at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118)
  at 
scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694)
  at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691)
  at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455)
  at 
akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407)
  at 
akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411)
  at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
  at java.lang.Thread.run(Thread.java:745)
{code}

This doesn't seem to prevent the job from completing successfully, but it's 
serious issue because it means that resources aren't being cleaned up.  The 
test script, ScalaAggByKeyInt, runs each test 10 times, and I see the same 
error after each test, so this seems deterministically reproducible.

I'll look at the executor logs to see if I can find more info there.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3138) sqlContext.parquetFile should be able to take a single file as parameter


[ 
https://issues.apache.org/jira/browse/SPARK-3138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14102961#comment-14102961
 ] 

Teng Qiu commented on SPARK-3138:
-

be careful if someone is working on SPARK-2551, make sure the new change passes 
test case {code}test(Read a parquet file instead of a directory){code}

 sqlContext.parquetFile should be able to take a single file as parameter
 

 Key: SPARK-3138
 URL: https://issues.apache.org/jira/browse/SPARK-3138
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Teng Qiu

 http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-parquetFile-path-fails-if-path-is-a-file-but-succeeds-if-a-directory-tp12345.html
 to reproduce this issue in spark-shell
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext._
 import org.apache.hadoop.fs.{FileSystem, Path}
 case class TestRDDEntry(key: Int, value: String)
 val path = /tmp/parquet_test
 sc.parallelize((1 to 100)).map(i = TestRDDEntry(i, 
 sval_$i)).coalesce(1).saveAsParquetFile(path)
 val fsPath = new Path(path)
 val fs: FileSystem = fsPath.getFileSystem(sc.hadoopConfiguration)
 val children = 
 fs.listStatus(fsPath).filter(_.getPath.getName.endsWith(.parquet))
 val readFile = sqlContext.parquetFile(path + / + 
 children(0).getPath.getName)
 {code}
 it throws exception:
 {code}
 java.lang.IllegalArgumentException: Expected 
 file:/tmp/parquet_test/part-r-1.parquet for be a directory with Parquet 
 files/metadata
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.scala:374)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:414)
 at 
 org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:66)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3139) Akka timeouts from ContextCleaner when cleaning shuffles