date:20140920

Reynold Xin created SPARK-3612:
--

 Summary: Executor shouldn't quit if heartbeat message fails to 
reach the driver
 Key: SPARK-3612
 URL: https://issues.apache.org/jira/browse/SPARK-3612
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin


The thread started by Executor.startDriverHeartbeater can actually terminate 
the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an 
exception. 

I don't think we should quit the executor this way. At the very least, we would 
want to log a more meaningful exception then simply
{code}
14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Driver Heartbeater,5,main]
org.apache.spark.SparkException: Error sending message [message = 
Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, 
ip-172-31-7-55.eu-west-1.compute.internal, 52303))]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190)
at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
... 1 more

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3612) Executor shouldn't quit if heartbeat message fails to reach the driver


[ 
https://issues.apache.org/jira/browse/SPARK-3612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141807#comment-14141807
 ] 

Reynold Xin commented on SPARK-3612:


[~andrewor14] [~sandyryza] any comment on this? I think you guys worked on this 
code.

 Executor shouldn't quit if heartbeat message fails to reach the driver
 --

 Key: SPARK-3612
 URL: https://issues.apache.org/jira/browse/SPARK-3612
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin

 The thread started by Executor.startDriverHeartbeater can actually terminate 
 the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an 
 exception. 
 I don't think we should quit the executor this way. At the very least, we 
 would want to log a more meaningful exception then simply
 {code}
 14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
 14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
 14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
 14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Driver Heartbeater,5,main]
 org.apache.spark.SparkException: Error sending message [message = 
 Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, 
 ip-172-31-7-55.eu-west-1.compute.internal, 52303))]
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
 ... 1 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3613) Don't record the size of each shuffle block for large jobs

Reynold Xin created SPARK-3613:
--

 Summary: Don't record the size of each shuffle block for large jobs
 Key: SPARK-3613
 URL: https://issues.apache.org/jira/browse/SPARK-3613
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


MapStatus saves the size of each block (1 byte per block) for a particular map 
task. This actually means the shuffle metadata is O(M*R), where M = num maps 
and R = num reduces.

If M is greater than a certain size, we should probably just send an average 
size instead of a whole array.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3613) Don't record the size of each shuffle block for large jobs


[ 
https://issues.apache.org/jira/browse/SPARK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141870#comment-14141870
 ] 

Apache Spark commented on SPARK-3613:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2470

 Don't record the size of each shuffle block for large jobs
 --

 Key: SPARK-3613
 URL: https://issues.apache.org/jira/browse/SPARK-3613
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin

 MapStatus saves the size of each block (1 byte per block) for a particular 
 map task. This actually means the shuffle metadata is O(M*R), where M = num 
 maps and R = num reduces.
 If M is greater than a certain size, we should probably just send an average 
 size instead of a whole array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3608) Spark EC2 Script does not correctly break when AWS tagging succeeds.


 [ 
https://issues.apache.org/jira/browse/SPARK-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3608:
---
Assignee: Vida Ha

 Spark EC2 Script does not correctly break when AWS tagging succeeds.
 

 Key: SPARK-3608
 URL: https://issues.apache.org/jira/browse/SPARK-3608
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Vida Ha
Assignee: Vida Ha
Priority: Critical
 Fix For: 1.2.0


 Spark EC2 script will tag 5 times and not break out correctly if things 
 succeed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3608) Spark EC2 Script does not correctly break when AWS tagging succeeds.


 [ 
https://issues.apache.org/jira/browse/SPARK-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3608.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0

 Spark EC2 Script does not correctly break when AWS tagging succeeds.
 

 Key: SPARK-3608
 URL: https://issues.apache.org/jira/browse/SPARK-3608
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.1.0
Reporter: Vida Ha
Assignee: Vida Ha
Priority: Critical
 Fix For: 1.2.0


 Spark EC2 script will tag 5 times and not break out correctly if things 
 succeed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3562) Periodic cleanup event logs


[ 
https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141876#comment-14141876
 ] 

Apache Spark commented on SPARK-3562:
-

User 'viper-kun' has created a pull request for this issue:
https://github.com/apache/spark/pull/2471

 Periodic cleanup event logs
 ---

 Key: SPARK-3562
 URL: https://issues.apache.org/jira/browse/SPARK-3562
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: xukun

  If we run spark application frequently, it will write many spark event log 
 into spark.eventLog.dir. After a long time later, there will be many spark 
 event log that we do not concern in the spark.eventLog.dir.Periodic cleanups 
 will ensure that logs older than this duration will be forgotten. It is no 
 need to clean logs by hands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3612) Executor shouldn't quit if heartbeat message fails to reach the driver

2014-09-20 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142006#comment-14142006
 ] 

Sandy Ryza commented on SPARK-3612:
---

Yeah, we should catch this.  Will post a patch.

 Executor shouldn't quit if heartbeat message fails to reach the driver
 --

 Key: SPARK-3612
 URL: https://issues.apache.org/jira/browse/SPARK-3612
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin

 The thread started by Executor.startDriverHeartbeater can actually terminate 
 the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an 
 exception. 
 I don't think we should quit the executor this way. At the very least, we 
 would want to log a more meaningful exception then simply
 {code}
 14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
 14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
 14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
 14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception 
 in thread Thread[Driver Heartbeater,5,main]
 org.apache.spark.SparkException: Error sending message [message = 
 Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, 
 ip-172-31-7-55.eu-west-1.compute.internal, 52303))]
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190)
 at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 
 seconds]
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176)
 ... 1 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-20 Thread Jatinpreet Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jatinpreet Singh updated SPARK-3614:

Description: 
The IDF class in MLlib does not provide the capability of defining a minimum 
number of documents a term should appear in the corpus. The idea is to have a 
cutoff variable which defines this minimum occurrence value, and the terms 
which have lower frequency are ignored.

Mathematically,
IDF(t,D)=(log|D|+1) / (DF(t,D)+1), for DF(t,D) =minimumOccurance


where, 
D is the total number of documents in the corpus
DF(t,D) is the number of documents that contains term t
minimumOccurance is the minimum number of documents the term appears in the 
document corpus

This would have an impact on accuracy as terms that appear in less than a 
certain limit of documents, have low or no importance in TFIDF vectors.

  was:
The IDF class in MLlib does not provide the capability of defining a minimum 
number of documents a term should appear in the corpus. The idea is to have a 
cutoff variable which defines this minimum occurrence value, and the terms 
which have lower frequency are ignored.

Mathematically,
IDF(t,D)=(log|D|+1) / (DF(t,D)+1), for DF(t,D) =minimumOccurance


where, 
|D| is the total number of documents in the corpus
DF(t,D) is the number of documents that contains term t
minimumOccurance is the minimum number of documents the term appears in the 
document corpus

This would have an impact on accuracy as terms that appear in less than a 
certain limit of documents, have low or no importance in TFIDF vectors.


 Filter on minimum occurrences of a term in IDF 
 ---

 Key: SPARK-3614
 URL: https://issues.apache.org/jira/browse/SPARK-3614
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jatinpreet Singh
Priority: Minor
  Labels: TFIDF

 The IDF class in MLlib does not provide the capability of defining a minimum 
 number of documents a term should appear in the corpus. The idea is to have a 
 cutoff variable which defines this minimum occurrence value, and the terms 
 which have lower frequency are ignored.
 Mathematically,
 IDF(t,D)=(log|D|+1) / (DF(t,D)+1), for DF(t,D) =minimumOccurance
 where, 
 D is the total number of documents in the corpus
 DF(t,D) is the number of documents that contains term t
 minimumOccurance is the minimum number of documents the term appears in the 
 document corpus
 This would have an impact on accuracy as terms that appear in less than a 
 certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-20 Thread Jatinpreet Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jatinpreet Singh updated SPARK-3614:

Description: 
The IDF class in MLlib does not provide the capability of defining a minimum 
number of documents a term should appear in the corpus. The idea is to have a 
cutoff variable which defines this minimum occurrence value, and the terms 
which have lower frequency are ignored.

Mathematically,
IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance


where, 
D is the total number of documents in the corpus
DF(t,D) is the number of documents that contains term t
minimumOccurance is the minimum number of documents the term appears in the 
document corpus

This would have an impact on accuracy as terms that appear in less than a 
certain limit of documents, have low or no importance in TFIDF vectors.

  was:
The IDF class in MLlib does not provide the capability of defining a minimum 
number of documents a term should appear in the corpus. The idea is to have a 
cutoff variable which defines this minimum occurrence value, and the terms 
which have lower frequency are ignored.

Mathematically,
IDF(t,D)=(log|D|+1) / (DF(t,D)+1), for DF(t,D) =minimumOccurance


where, 
D is the total number of documents in the corpus
DF(t,D) is the number of documents that contains term t
minimumOccurance is the minimum number of documents the term appears in the 
document corpus

This would have an impact on accuracy as terms that appear in less than a 
certain limit of documents, have low or no importance in TFIDF vectors.


 Filter on minimum occurrences of a term in IDF 
 ---

 Key: SPARK-3614
 URL: https://issues.apache.org/jira/browse/SPARK-3614
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jatinpreet Singh
Priority: Minor
  Labels: TFIDF

 The IDF class in MLlib does not provide the capability of defining a minimum 
 number of documents a term should appear in the corpus. The idea is to have a 
 cutoff variable which defines this minimum occurrence value, and the terms 
 which have lower frequency are ignored.
 Mathematically,
 IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance
 where, 
 D is the total number of documents in the corpus
 DF(t,D) is the number of documents that contains term t
 minimumOccurance is the minimum number of documents the term appears in the 
 document corpus
 This would have an impact on accuracy as terms that appear in less than a 
 certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-20 Thread Jatinpreet Singh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jatinpreet Singh updated SPARK-3614:

Description: 
The IDF class in MLlib does not provide the capability of defining a minimum 
number of documents a term should appear in the corpus. The idea is to have a 
cutoff variable which defines this minimum occurrence value, and the terms 
which have lower frequency are ignored.

Mathematically,
IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance


where, 
D is the total number of documents in the corpus
DF(t,D) is the number of documents that contain the term t
minimumOccurance is the minimum number of documents the term appears in the 
document corpus

This would have an impact on accuracy as terms that appear in less than a 
certain limit of documents, have low or no importance in TFIDF vectors.

  was:
The IDF class in MLlib does not provide the capability of defining a minimum 
number of documents a term should appear in the corpus. The idea is to have a 
cutoff variable which defines this minimum occurrence value, and the terms 
which have lower frequency are ignored.

Mathematically,
IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance


where, 
D is the total number of documents in the corpus
DF(t,D) is the number of documents that contains term t
minimumOccurance is the minimum number of documents the term appears in the 
document corpus

This would have an impact on accuracy as terms that appear in less than a 
certain limit of documents, have low or no importance in TFIDF vectors.


 Filter on minimum occurrences of a term in IDF 
 ---

 Key: SPARK-3614
 URL: https://issues.apache.org/jira/browse/SPARK-3614
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jatinpreet Singh
Priority: Minor
  Labels: TFIDF

 The IDF class in MLlib does not provide the capability of defining a minimum 
 number of documents a term should appear in the corpus. The idea is to have a 
 cutoff variable which defines this minimum occurrence value, and the terms 
 which have lower frequency are ignored.
 Mathematically,
 IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance
 where, 
 D is the total number of documents in the corpus
 DF(t,D) is the number of documents that contain the term t
 minimumOccurance is the minimum number of documents the term appears in the 
 document corpus
 This would have an impact on accuracy as terms that appear in less than a 
 certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3615) Kafka test should not hard code Zookeeper port

Patrick Wendell created SPARK-3615:
--

 Summary: Kafka test should not hard code Zookeeper port
 Key: SPARK-3615
 URL: https://issues.apache.org/jira/browse/SPARK-3615
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Saisai Shao


This is causing failures in our master build if port 2181 is contented. Instead 
of binding to a static port we should re-factor this such that it opens a 
socket on port 0 and then reads back the port. So we can never have contention.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3610) Unable to load app logs for MLLib programs in history server


 [ 
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3610:
---
Priority: Critical  (was: Major)

 Unable to load app logs for MLLib programs in history server 
 -

 Key: SPARK-3610
 URL: https://issues.apache.org/jira/browse/SPARK-3610
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical
 Fix For: 1.1.0


 The default log files for the Mllib examples use a rather long naming 
 convention that includes special characters like parentheses and comma.For 
 e.g. one of my log files is named 
 binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.
 When I click on the program on the history server page (at port 18080), to 
 view the detailed application logs, the history server crashes and I need to 
 restart it. I am using Spark 1.1 on a mesos cluster.
 I renamed the  log file by removing the special characters and  then it loads 
 up correctly. I am not sure which program is creating the log files. Can it 
 be changed so that the default log file naming convention does not include  
 special characters? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3610) History server log name should not be based on user input

[
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-3610:
---
Description:
Right now we don't have a

Original bug report:
The default log files for the Mllib examples use a rather long naming
convention that includes special characters like parentheses and comma.For e.g.
one of my log files is named
binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.

When I click on the program on the history server page (at port 18080), to view
the detailed application logs, the history server crashes and I need to restart
it. I am using Spark 1.1 on a mesos cluster.

I renamed the log file by removing the special characters and then it loads
up correctly. I am not sure which program is creating the log files. Can it be
changed so that the default log file naming convention does not include
special characters?

was:
The default log files for the Mllib examples use a rather long naming
convention that includes special characters like parentheses and comma.For e.g.
one of my log files is named
binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.

History server log name should not be based on user input
-

Key: SPARK-3610
URL: https://issues.apache.org/jira/browse/SPARK-3610
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical
Fix For: 1.1.0

Right now we don't have a
Original bug report:
The default log files for the Mllib examples use a rather long naming
convention that includes special characters like parentheses and comma.For
e.g. one of my log files is named
binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.
When I click on the program on the history server page (at port 18080), to
view the detailed application logs, the history server crashes and I need to
restart it. I am using Spark 1.1 on a mesos cluster.
I renamed the log file by removing the special characters and then it loads
up correctly. I am not sure which program is creating the log files. Can it
be changed so that the default log file naming convention does not include
special characters?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3610) History server log name should not be based on user input

[
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-3610:
---
Description:
Right now we use the user-defined application name when creating the logging
file for the history server. We should use some type of GUID generated from
inside of Spark instead of allowing user input here. It can cause errors if
users provide characters that are not valid in filesystem paths.

Original bug report:
{quote}
The default log files for the Mllib examples use a rather long naming
convention that includes special characters like parentheses and comma.For e.g.
one of my log files is named
binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.

was:
Right now we use the user-defined application name when creating the logging
file for the history server. We should use some type of GUID generated from
inside of Spark instead of allowing user input here. It can cause errors if
users provide characters that are not valid in filesystem paths.

History server log name should not be based on user input
-

Key: SPARK-3610
URL: https://issues.apache.org/jira/browse/SPARK-3610
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical
Fix For: 1.1.0

Right now we use the user-defined application name when creating the logging
file for the history server. We should use some type of GUID generated from
inside of Spark instead of allowing user input here. It can cause errors if
users provide characters that are not valid in filesystem paths.
Original bug report:
{quote}
The default log files for the Mllib examples use a rather long naming
convention that includes special characters like parentheses and comma.For
e.g. one of my log files is named
binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.
When I click on the program on the history server page (at port 18080), to
view the detailed application logs, the history server crashes and I need to
restart it. I am using Spark 1.1 on a mesos cluster.
I renamed the log file by removing the special characters and then it loads
up correctly. I am not sure which program is creating the log files. Can it
be changed so that the default log file naming convention does not include
special characters?
{quote}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3610) History server log name should not be based on user input

[
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

was:
Right now we don't have a

History server log name should not be based on user input
-

Key: SPARK-3610
URL: https://issues.apache.org/jira/browse/SPARK-3610
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical
Fix For: 1.1.0

Right now we use the user-defined application name when creating the logging
file for the history server. We should use some type of GUID generated from
inside of Spark instead of allowing user input here. It can cause errors if
users provide characters that are not valid in filesystem paths.
Original bug report:
The default log files for the Mllib examples use a rather long naming
convention that includes special characters like parentheses and comma.For
e.g. one of my log files is named
binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032.
When I click on the program on the history server page (at port 18080), to
view the detailed application logs, the history server crashes and I need to
restart it. I am using Spark 1.1 on a mesos cluster.
I renamed the log file by removing the special characters and then it loads
up correctly. I am not sure which program is creating the log files. Can it
be changed so that the default log file naming convention does not include
special characters?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3610) History server log name should not be based on user input

2014-09-20 Thread Josh Rosen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Rosen updated SPARK-3610:
--
Component/s: Web UI

History server log name should not be based on user input
-

Key: SPARK-3610
URL: https://issues.apache.org/jira/browse/SPARK-3610
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3610) History server log name should not be based on user input

[
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-3610:
---
Fix Version/s: (was: 1.1.0)

History server log name should not be based on user input
-

Key: SPARK-3610
URL: https://issues.apache.org/jira/browse/SPARK-3610
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3610) History server log name should not be based on user input

[
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-3610:
---
Component/s: (was: Web UI)
Target Version/s: 1.2.0

History server log name should not be based on user input
-

Key: SPARK-3610
URL: https://issues.apache.org/jira/browse/SPARK-3610
Project: Spark
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: SK
Priority: Critical

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3609) Add sizeInBytes statistics to Limit operator


 [ 
https://issues.apache.org/jira/browse/SPARK-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3609.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2468
[https://github.com/apache/spark/pull/2468]

 Add sizeInBytes statistics to Limit operator
 

 Key: SPARK-3609
 URL: https://issues.apache.org/jira/browse/SPARK-3609
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.2.0


 The {{sizeInBytes}} statistics of a {{LIMIT}} operator can be estimated 
 fairly precisely when all output attributes are of native data types, all 
 native data types except {{StringType}} have fixed size. For {{StringType}}, 
 we can use a relatively large (say 4K) default size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3617) Configurable case sensitivity

Michael Armbrust created SPARK-3617:
---

 Summary: Configurable case sensitivity
 Key: SPARK-3617
 URL: https://issues.apache.org/jira/browse/SPARK-3617
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust


Right now SQLContext is case sensitive and HiveContext is not.  It would be 
better to make it configurable in both instances.  All of the underlying 
plumbing is there we just need to expose an option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3618) Store analyzed plans for temp tables

Michael Armbrust created SPARK-3618:
---

 Summary: Store analyzed plans for temp tables
 Key: SPARK-3618
 URL: https://issues.apache.org/jira/browse/SPARK-3618
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


Right now we store unanalyzed logical plans for temporary tables.  However this 
means that changes to session state (e.g., the current database) could result 
in tables becoming inaccessible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2274) spark SQL query hang up sometimes


 [ 
https://issues.apache.org/jira/browse/SPARK-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2274.
-
   Resolution: Fixed
Fix Version/s: 1.1.0

The attached query is using a left outer join, which until 1.1 was executed 
very slowly.  Please reopen this issue if problems persist on a newer version 
of Spark.

 spark SQL query hang up sometimes
 -

 Key: SPARK-2274
 URL: https://issues.apache.org/jira/browse/SPARK-2274
 Project: Spark
  Issue Type: Question
  Components: SQL
 Environment: spark 1.0.0
Reporter: jackielihf
Assignee: Michael Armbrust
 Fix For: 1.1.0


 when I run spark SQL query, it hang up sometimes:
 1) simple SQL query works, such as select * from a left out join b on 
 a.id=b.id
 2) BUT if it has more joins, such as select * from a left out join b on 
 a.id=b.id left out join c on a.id=c.id..., spark shell seems to hang up.
 spark shell prints:
 scala hc.hql(select 
 A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code
  as 
 type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht
  from txt_nws_bas_update A left outer join txt_nws_bas_txt B on 
 A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer 
 join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on 
 D.secu_id=E.secu_id where D.secu_id is not null limit 5).foreach(println)
 14/06/25 13:32:25 INFO ParseDriver: Parsing command: select 
 A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code
  as 
 type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht
  from txt_nws_bas_update A left outer join txt_nws_bas_txt B on 
 A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer 
 join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on 
 D.secu_id=E.secu_id where D.secu_id is not null limit 5
 14/06/25 13:32:25 INFO ParseDriver: Parse Completed
 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch 
 MultiInstanceRelations
 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch 
 CaseInsensitiveAttributeReferences
 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220923) called with 
 curMem=0, maxMem=311387750
 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_0 stored as values to 
 memory (estimated size 215.7 KB, free 296.8 MB)
 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=220923, maxMem=311387750
 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_1 stored as values to 
 memory (estimated size 215.8 KB, free 296.5 MB)
 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=441894, maxMem=311387750
 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_2 stored as values to 
 memory (estimated size 215.8 KB, free 296.3 MB)
 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=662865, maxMem=311387750
 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_3 stored as values to 
 memory (estimated size 215.8 KB, free 296.1 MB)
 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with 
 curMem=883836, maxMem=311387750
 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_4 stored as values to 
 memory (estimated size 215.8 KB, free 295.9 MB)
 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for 
 batch Add exchange
 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for 
 batch Prepare Expressions
 14/06/25 13:32:30 INFO FileInputFormat: Total input paths to process : 1
 14/06/25 13:32:30 INFO SparkContext: Starting job: collect at joins.scala:184
 14/06/25 13:32:30 INFO DAGScheduler: Got job 0 (collect at joins.scala:184) 
 with 2 output partitions (allowLocal=false)
 14/06/25 13:32:30 INFO DAGScheduler: Final stage: Stage 0(collect at 
 joins.scala:184)
 14/06/25 13:32:30 INFO DAGScheduler: Parents of final stage: List()
 14/06/25 13:32:30 INFO DAGScheduler: Missing parents: List()
 14/06/25 13:32:30 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at map 
 at joins.scala:184), which has no missing parents
 14/06/25 13:32:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
 (MappedRDD[7] at map at joins.scala:184)
 14/06/25 13:32:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
 executor 0: 192.168.56.100 (PROCESS_LOCAL)
 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:0 as 4088 bytes in 
 3 ms
 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:1 as TID

[jira] [Updated] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization


 [ 
https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3267:

Assignee: (was: Michael Armbrust)

 Deadlock between ScalaReflectionLock and Data type initialization
 -

 Key: SPARK-3267
 URL: https://issues.apache.org/jira/browse/SPARK-3267
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Priority: Critical

 Deadlock here:
 {code}
 Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 
 nid=0x27a in Object.wait() [0x7fab60c2e000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:202)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal
 a:175)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:304)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:314)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:313)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 ...
 {code}
 and
 {code}
 Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 
 nid=0x27e in Object.wait() [0x7fab0eeec000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250)
 - locked 0x00064e5d9a48 (a 
 org.apache.spark.sql.catalyst.expressions.Cast)
 at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126)
 at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at

[jira] [Resolved] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names


 [ 
https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3414.
-
Resolution: Fixed

Issue resolved by pull request 2382
[https://github.com/apache/spark/pull/2382]

 Case insensitivity breaks when unresolved relation contains attributes with 
 uppercase letters in their names
 

 Key: SPARK-3414
 URL: https://issues.apache.org/jira/browse/SPARK-3414
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Cheng Lian
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.2.0


 Paste the following snippet to {{spark-shell}} (need Hive support) to 
 reproduce this issue:
 {code}
 import org.apache.spark.sql.hive.HiveContext
 val hiveContext = new HiveContext(sc)
 import hiveContext._
 case class LogEntry(filename: String, message: String)
 case class LogFile(name: String)
 sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs)
 sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles)
 val srdd = sql(
   
 SELECT name, message
 FROM rawLogs
 JOIN (
   SELECT name
   FROM logFiles
 ) files
 ON rawLogs.filename = files.name
   )
 srdd.registerTempTable(boom)
 sql(select * from boom)
 {code}
 Exception thrown:
 {code}
 SchemaRDD[7] at RDD at SchemaRDD.scala:103
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 Project [*]
  LowerCaseSchema
   Subquery boom
Project ['name,'message]
 Join Inner, Some(('rawLogs.filename = name#2))
  LowerCaseSchema
   Subquery rawlogs
SparkLogicalPlan (ExistingRdd [filename#0,message#1], 
 MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208)
  Subquery files
   Project [name#2]
LowerCaseSchema
 Subquery logfiles
  SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at 
 mapPartitions at basicOperators.scala:208)
 {code}
 Notice that {{rawLogs}} in the join operator is not lowercased.
 The reason is that, during analysis phase, the 
 {{CaseInsensitiveAttributeReferences}} batch is only executed before the 
 {{Resolution}} batch. And when {{srdd}} is registered as temporary table 
 {{boom}}, its original (unanalyzed) logical plan is stored into the catalog:
 {code}
 Join Inner, Some(('rawLogs.filename = 'files.name))
  UnresolvedRelation None, rawLogs, None
  Subquery files
   Project ['name]
UnresolvedRelation None, logFiles, None
 {code}
 notice that attributes referenced in the join operator (esp. {{rawLogs}}) is 
 not lowercased yet.
 And then, when {{select * from boom}} is been analyzed, its input logical 
 plan is:
 {code}
 Project [*]
  UnresolvedRelation None, boom, None
 {code}
 here the unresolved relation points to the unanalyzed logical plan of 
 {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus 
 not touched by {{CaseInsensitiveAttributeReferences}} at all, and 
 {{rawLogs.filename}} is thus not lowercased:
 {code}
 === Applying Rule 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
  Project [*]Project [*]
 ! UnresolvedRelation None, boom, NoneLowerCaseSchema
 ! Subquery boom
 !  Project ['name,'message]
 !   Join Inner, 
 Some(('rawLogs.filename = 'files.name))
 !LowerCaseSchema
 ! Subquery rawlogs
 !  SparkLogicalPlan (ExistingRdd 
 [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at 
 basicOperators.scala:208)
 !Subquery files
 ! Project ['name]
 !  LowerCaseSchema
 !   Subquery logfiles
 !SparkLogicalPlan 
 (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at 
 basicOperators.scala:208)
 {code}
 A reasonable fix for this could be always register analyzed logical plan to 
 the catalog when registering temporary tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688

2014-09-20 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-3619:


 Summary: Upgrade to Mesos 0.21 to work around MESOS-1688
 Key: SPARK-3619
 URL: https://issues.apache.org/jira/browse/SPARK-3619
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Matei Zaharia


When Mesos 0.21 comes out, it will have a fix for 
https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3616) Add Selenium tests to Web UI


[ 
https://issues.apache.org/jira/browse/SPARK-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142244#comment-14142244
 ] 

Apache Spark commented on SPARK-3616:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2474

 Add Selenium tests to Web UI
 

 Key: SPARK-3616
 URL: https://issues.apache.org/jira/browse/SPARK-3616
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Josh Rosen
Assignee: Josh Rosen

 We should add basic Selenium tests to Web UI suite.  This will make it easy 
 to write regression tests / reproductions for UI bugs and will be useful in 
 testing some planned refactorings / redesigns that I'm working on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD


[ 
https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142257#comment-14142257
 ] 

Patrick Wendell commented on SPARK-3604:


After looking at the PR - I think the issue here is just to use the built in 
SparkContext.union utility when unioning large numbers of RDD's. I misread this 
originally as a bug report, but I think it's just a usage issue. Perhaps we 
should document this in the normal union() scaladoc.

 unbounded recursion in getNumPartitions triggers stack overflow for large 
 UnionRDD
 --

 Key: SPARK-3604
 URL: https://issues.apache.org/jira/browse/SPARK-3604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: linux.  Used python, but error is in Scala land.
Reporter: Eric Friedman
Priority: Critical

 I have a large number of parquet files all with the same schema and attempted 
 to make a UnionRDD out of them.
 When I call getNumPartitions(), I get a stack overflow error
 that looks like this:
 Py4JJavaError: An error occurred while calling o3275.partitions.
 : java.lang.StackOverflowError
   at 
 scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD


[ 
https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142257#comment-14142257
 ] 

Patrick Wendell edited comment on SPARK-3604 at 9/21/14 12:35 AM:
--

After looking at the PR - I think the solution here is just to use the built in 
SparkContext.union utility when unioning large numbers of RDD's. I misread this 
originally as a bug report, but I think it's just a usage issue. Perhaps we 
should document this in the normal union() scaladoc.


was (Author: pwendell):
After looking at the PR - I think the issue here is just to use the built in 
SparkContext.union utility when unioning large numbers of RDD's. I misread this 
originally as a bug report, but I think it's just a usage issue. Perhaps we 
should document this in the normal union() scaladoc.

 unbounded recursion in getNumPartitions triggers stack overflow for large 
 UnionRDD
 --

 Key: SPARK-3604
 URL: https://issues.apache.org/jira/browse/SPARK-3604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: linux.  Used python, but error is in Scala land.
Reporter: Eric Friedman
Priority: Critical

 I have a large number of parquet files all with the same schema and attempted 
 to make a UnionRDD out of them.
 When I call getNumPartitions(), I get a stack overflow error
 that looks like this:
 Py4JJavaError: An error occurred while calling o3275.partitions.
 : java.lang.StackOverflowError
   at 
 scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
   at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3620) Refactor parameter handling code for spark-submit

2014-09-20 Thread Dale Richardson (JIRA)

Dale Richardson created SPARK-3620:
--

 Summary: Refactor parameter handling code for spark-submit
 Key: SPARK-3620
 URL: https://issues.apache.org/jira/browse/SPARK-3620
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.1.0, 1.0.0
Reporter: Dale Richardson
Priority: Minor


I'm proposing its time to refactor the configuration argument handling code in 
spark-submit. The code has grown organically in a short period of time, handles 
a pretty complicated logic flow, and is now pretty fragile. Some issues that 
have been identified:

1. Hand-crafted property file readers that do not support the property file 
format as specified in 
http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)

2. ResolveURI not called on paths read from conf/prop files

3. inconsistent means of merging / overriding values from different sources 
(Some get overridden by file, others by manual settings of field on object, 
Some by properties)

4. Argument validation should be done after combining config files, system 
properties and command line arguments, 

5. Alternate conf file location not handled in shell scripts

6. Some options can only be passed as command line arguments

7. Defaults for options are hard-coded (and sometimes overridden multiple 
times) in many through-out the code e.g. master = local[*]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3620) Refactor config option handling code for spark-submit

2014-09-20 Thread Dale Richardson (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dale Richardson updated SPARK-3620:
---
Summary: Refactor config option handling code for spark-submit (was:
Refactor parameter handling code for spark-submit)

Refactor config option handling code for spark-submit
-

Key: SPARK-3620
URL: https://issues.apache.org/jira/browse/SPARK-3620
Project: Spark
Issue Type: Improvement
Components: Deploy
Affects Versions: 1.0.0, 1.1.0
Reporter: Dale Richardson
Assignee: Dale Richardson
Priority: Minor

I'm proposing its time to refactor the configuration argument handling code
in spark-submit. The code has grown organically in a short period of time,
handles a pretty complicated logic flow, and is now pretty fragile. Some
issues that have been identified:
1. Hand-crafted property file readers that do not support the property file
format as specified in
http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)
2. ResolveURI not called on paths read from conf/prop files
3. inconsistent means of merging / overriding values from different sources
(Some get overridden by file, others by manual settings of field on object,
Some by properties)
4. Argument validation should be done after combining config files, system
properties and command line arguments,
5. Alternate conf file location not handled in shell scripts
6. Some options can only be passed as command line arguments
7. Defaults for options are hard-coded (and sometimes overridden multiple
times) in many through-out the code e.g. master = local[*]

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3247) Improved support for external data sources


[ 
https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142277#comment-14142277
 ] 

Apache Spark commented on SPARK-3247:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2475

 Improved support for external data sources
 --

 Key: SPARK-3247
 URL: https://issues.apache.org/jira/browse/SPARK-3247
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3599) Avoid loading and printing properties file content frequently


 [ 
https://issues.apache.org/jira/browse/SPARK-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3599.

Resolution: Fixed
  Assignee: WangTaoTheTonic

 Avoid loading and printing properties file content frequently
 -

 Key: SPARK-3599
 URL: https://issues.apache.org/jira/browse/SPARK-3599
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
Priority: Minor
 Attachments: too many verbose.txt


 When I use -v | -verbos in spark-submit, there prints lots of message about 
 contents in properties file. 
 After checking code in SparkSubmit.scala and SparkSubmitArguments.scala, I 
 found the getDefaultSparkProperties method is invoked in three places, and 
 every time we invoke it, we load properties from properties file, and print 
 again if option -v used.
 We might should use a value instead of method when we use default properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1966) Cannot cancel tasks running locally

2014-09-20 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142334#comment-14142334
 ] 

Josh Rosen commented on SPARK-1966:
---

Actually, scratch that; it wasn't an issue since local execution of tasks is 
now disabled by default.  But I think this bug still exists.

 Cannot cancel tasks running locally
 ---

 Key: SPARK-1966
 URL: https://issues.apache.org/jira/browse/SPARK-1966
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0, 1.1.0
Reporter: Aaron Davidson





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1597) Add a version of reduceByKey that takes the Partitioner as a second argument


[ 
https://issues.apache.org/jira/browse/SPARK-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142335#comment-14142335
 ] 

Patrick Wendell commented on SPARK-1597:


See relevant comment here:
https://github.com/apache/spark/pull/550#issuecomment-41483260

 Add a version of reduceByKey that takes the Partitioner as a second argument
 

 Key: SPARK-1597
 URL: https://issues.apache.org/jira/browse/SPARK-1597
 Project: Spark
  Issue Type: Bug
Reporter: Matei Zaharia
Assignee: Sandeep Singh
Priority: Blocker

 Most of our shuffle methods can take a Partitioner or a number of partitions 
 as a second argument, but for some reason reduceByKey takes the Partitioner 
 as a *first* argument: 
 http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions.
  We should deprecate that version and add one where the Partitioner is the 
 second argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-20 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created SPARK-3621:
--

 Summary: Provide a way to broadcast an RDD (instead of just a 
variable made of the RDD) so that a job can access
 Key: SPARK-3621
 URL: https://issues.apache.org/jira/browse/SPARK-3621
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0, 1.0.0
Reporter: Xuefu Zhang


In some cases, such as Hive's way of doing map-side join, it would be benefcial 
to allow client program to broadcast RDDs rather than just variables made of 
these RDDs. Broadcasting a variable made of RDDs requires all RDD data be 
collected to the driver and that the variable be shipped to the cluster after 
being made. It would be more performing if driver just broadcasts the RDDs and 
uses the corresponding data in jobs (such building hashmaps at executors).

Tez has a broadcast edge which can ship data from previous stage to the next 
stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

2014-09-20 Thread Xuefu Zhang (JIRA)

Xuefu Zhang created SPARK-3622:
--

 Summary: Provide a custom transformation that can output multiple 
RDDs
 Key: SPARK-3622
 URL: https://issues.apache.org/jira/browse/SPARK-3622
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xuefu Zhang


All existing transformations return just one RDD at most, even for those which 
takes user-supplied functions such as mapPartitions() . However, sometimes a 
user provided function may need to output multiple RDDs. For instance, a filter 
function that divides the input RDD into serveral RDDs. While it's possible to 
get multiple RDDs by transforming the same RDD multiple times, it may be more 
efficient to do this concurrently in one shot. Especially user's existing 
function is already generating different data sets.

This the case in Hive on Spark, where Hive's map function and reduce function 
can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect


 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2630:
---
Target Version/s: 1.2.0  (was: 1.1.0)

 Input data size of CoalescedRDD is incorrect
 

 Key: SPARK-2630
 URL: https://issues.apache.org/jira/browse/SPARK-2630
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Priority: Critical
 Attachments: overflow.tiff


 Given one big file, such as text.4.3G, put it in one task, 
 sc.textFile(text.4.3.G).coalesce(1).count()
 In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect


 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2630:
---
Assignee: Andrew Ash

 Input data size of CoalescedRDD is incorrect
 

 Key: SPARK-2630
 URL: https://issues.apache.org/jira/browse/SPARK-2630
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Assignee: Andrew Ash
Priority: Blocker
 Attachments: overflow.tiff


 Given one big file, such as text.4.3G, put it in one task, 
 sc.textFile(text.4.3.G).coalesce(1).count()
 In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect