[jira] [Commented] (SPARK-19103) In web ui,URL's host name should be a specific IP address.

2017-01-06 Thread guoxiaolong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807033#comment-15807033
 ] 

guoxiaolong commented on SPARK-19103:
-

Because the browser opens this address 404.We must configure the domain name 
and IP mapping in the hosts file..Please see the attachment.3.png

> In web ui,URL's host name should be a specific IP address.
> --
>
> Key: SPARK-19103
> URL: https://issues.apache.org/jira/browse/SPARK-19103
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
> Environment: spark 2.0.2
>Reporter: guoxiaolong
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> In web ui,URL's host name should be a specific IP address.Because open URL 
> must be resolve host name.It can not find host name.So URL can not 
> find.Please see the attachment.Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19115) SparkSQL unsupported the command " create external table if not exist new_tbl like old_tbl"

2017-01-06 Thread Xiaochen Ouyang (JIRA)
Xiaochen Ouyang created SPARK-19115:
---

 Summary: SparkSQL unsupported the command " create external table 
if not exist  new_tbl like old_tbl"
 Key: SPARK-19115
 URL: https://issues.apache.org/jira/browse/SPARK-19115
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
 Environment: spark2.0.1 hive1.2.1
Reporter: Xiaochen Ouyang


spark2.0.1 unsupported the command " create external table if not exist  
new_tbl like old_tbl"
we tried to modify the  sqlbase.g4 file,change
"| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
to
"| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier
LIKE source=tableIdentifier
#createTableLike"
and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" 
,after that,we found we can run command "create external table if not exist  
new_tbl like old_tbl" successfully,unfortunately we found the generated table's 
type is  MANAGED_TABLE other than  EXTERNAL_TABLE in metastore database .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-01-06 Thread jin xing (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806897#comment-15806897
 ] 

jin xing commented on SPARK-18113:
--

[~xq2005], [~aash]
I am seeing this issue in my cluster some times. If 
*OutputCommittCoordinatorEndpoint* receive *AskPermissionToCommitOutput* for 
the first time, *OutputCommitCoordinatoryEndpoint* will mark the task attempt 
as a committer in *authorizedCommittersByStage* and send back the response. But 
if the worker failed to get the response in *spark.rpc.timeout*, it will retry 
sending *AskPermissionToCommitOutput*. However it will be denied by 
*OutputCommitCoordinatorEndpoint*, because it has already registered a 
committer for the partition, even though the registered committer and the 
worker are the same. 
Reproducing is easy:
{code:title=OutputCommitCoordinator.scala|borderStyle=solid}
..
  // Marked private[scheduler] instead of private so this can be mocked in tests
  private[scheduler] def handleAskPermissionToCommit(
  stage: StageId,
  partition: PartitionId,
  attemptNumber: TaskAttemptNumber): Boolean = synchronized {
authorizedCommittersByStage.get(stage) match {
  case Some(authorizedCommitters) =>
authorizedCommitters(partition) match {
  case NO_AUTHORIZED_COMMITTER =>
logDebug(s"Authorizing attemptNumber=$attemptNumber to commit for 
stage=$stage, " +
  s"partition=$partition")
authorizedCommitters(partition) = attemptNumber
Thread.sleep(15)
true
  case existingCommitter =>
logDebug(s"Denying attemptNumber=$attemptNumber to commit for 
stage=$stage, " +
  s"partition=$partition; existingCommitter = $existingCommitter")
false
}
  case None =>
logDebug(s"Stage $stage has completed, so not allowing attempt number 
$attemptNumber of" +
  s"partition $partition to commit")
false
}
  }
..
{code}
When worker asks to be registered as a committer for the first time, sleep 150 
seconds, which is bigger than *spark.rpc.timeout=120 seconds*. when worker 
retries *AskPermissionToCommitOutput* it will get *CommitDeniedException*, then 
the task will fail with reason *TaskCommitDenied*, which is not regarded as a 
task failure(SPARK-11178), so TaskScheduler will schedule this task infinitely.

[~xq2005]
If you don't have time, could I make a pr for this?

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> 

[jira] [Closed] (SPARK-18929) Add Tweedie distribution in GLM

2017-01-06 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang closed SPARK-18929.
---
Resolution: Unresolved

> Add Tweedie distribution in GLM
> ---
>
> Key: SPARK-18929
> URL: https://issues.apache.org/jira/browse/SPARK-18929
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I propose to add the full Tweedie family into the GeneralizedLinearRegression 
> model. The Tweedie family is characterized by a power variance function. 
> Currently supported distributions such as Gaussian,  Poisson and Gamma 
> families are a special case of the 
> [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. 
> I propose to add support for the other distributions:
> * compound Poisson: 1 < variancePower < 2. This one is widely used to model 
> zero-inflated continuous distributions. 
> * positive stable: variancePower > 2 and variancePower != 3. Used to model 
> extreme values.
> * inverse Gaussian: variancePower = 3.
>  The Tweedie family is supported in most statistical packages such as R 
> (statmod), SAS, h2o etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)

2017-01-06 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806774#comment-15806774
 ] 

Shivaram Venkataraman commented on SPARK-19108:
---

+1 - This is a good idea. One thing I'd like to add is that it might be better 
to create one broadcast rather than two broadcasts for sake of efficiency. For 
each broadcast variable we contact the driver to get location information and 
then initiate some fetches -- Thus to keep the number of messages lower having 
one broadcast variable will be better.

> Broadcast all shared parts of tasks (to reduce task serialization time)
> ---
>
> Key: SPARK-19108
> URL: https://issues.apache.org/jira/browse/SPARK-19108
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Kay Ousterhout
>
> Expand the amount of information that's broadcasted for tasks, to avoid 
> serializing data per-task that should only be sent to each executor once for 
> the entire stage.
> Conceptually, this means we'd have new classes  specially for sending the 
> minimal necessary data to the executor, like:
> {code}
> /**
>   * metadata about the taskset needed by the executor for all tasks in this 
> taskset.  Subset of the
>   * full data kept on the driver to make it faster to serialize and send to 
> executors.
>   */
> class ExecutorTaskSetMeta(
>   val stageId: Int,
>   val stageAttemptId: Int,
>   val properties: Properties,
>   val addedFiles: Map[String, String],
>   val addedJars: Map[String, String]
>   // maybe task metrics here?
> )
> class ExecutorTaskData(
>   val partitionId: Int,
>   val attemptNumber: Int,
>   val taskId: Long,
>   val taskBinary: Broadcast[Array[Byte]],
>   val taskSetMeta: Broadcast[ExecutorTaskSetMeta]
> )
> {code}
> Then all the info you'd need to send to the executors would be a serialized 
> version of ExecutorTaskData.  Furthermore, given the simplicity of that 
> class, you could serialize manually, and then for each task you could just 
> modify the first two ints & one long directly in the byte buffer.  (You could 
> do the same trick for serialization even if ExecutorTaskSetMeta was not a 
> broadcast, but that will keep the msgs small as well.)
> There a bunch of details I'm skipping here: you'd also need to do some 
> special handling for the TaskMetrics; the way tasks get started in the 
> executor would change; you'd also need to refactor {{Task}} to let it get 
> reconstructed from this information (or add more to ExecutorTaskSetMeta); and 
> probably other details I'm overlooking now.
> (this is copied from SPARK-18890 and [~imranr]'s comment there; cc 
> [~shivaram])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model

2017-01-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19110:
--
Shepherd: Joseph K. Bradley
Target Version/s: 1.6.4, 2.0.3, 2.1.1, 2.2.0

> DistributedLDAModel returns different logPrior for original and loaded model
> 
>
> Key: SPARK-19110
> URL: https://issues.apache.org/jira/browse/SPARK-19110
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Miao Wang
>
> While adding DistributedLDAModel training summary for SparkR, I found that 
> the logPrior for original and loaded model is different.
> For example, in the test("read/write DistributedLDAModel"), I add the test:
> val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
>   val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
>   assert(logPrior === logPrior2)
> The test fails:
> -4.394180878889078 did not equal -4.294290536919573



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model

2017-01-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19110:
--
Affects Version/s: 2.2.0
   1.3.1
   1.4.1
   1.5.2
   1.6.3
   2.0.2
   2.1.0

> DistributedLDAModel returns different logPrior for original and loaded model
> 
>
> Key: SPARK-19110
> URL: https://issues.apache.org/jira/browse/SPARK-19110
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Miao Wang
>
> While adding DistributedLDAModel training summary for SparkR, I found that 
> the logPrior for original and loaded model is different.
> For example, in the test("read/write DistributedLDAModel"), I add the test:
> val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
>   val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
>   assert(logPrior === logPrior2)
> The test fails:
> -4.394180878889078 did not equal -4.294290536919573



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18194) Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit

2017-01-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18194.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16480
[https://github.com/apache/spark/pull/16480]

> Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit
> --
>
> Key: SPARK-18194
> URL: https://issues.apache.org/jira/browse/SPARK-18194
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Sue Ann Hong
> Fix For: 2.2.0
>
>
> Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806575#comment-15806575
 ] 

Apache Spark commented on SPARK-16920:
--

User 'mhmoudr' has created a pull request for this issue:
https://github.com/apache/spark/pull/16495

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16920:


Assignee: (was: Apache Spark)

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16920:


Assignee: Apache Spark

> Investigate and fix issues introduced in SPARK-15858
> 
>
> Key: SPARK-16920
> URL: https://issues.apache.org/jira/browse/SPARK-16920
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Apache Spark
>
> There were several issues regarding the PR resolving SPARK-15858, my comments 
> are available here:
> https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93
> The two most important issues are:
> 1. The PR did not add a stress test proving it resolved the issue it was 
> supposed to (though I have no doubt the optimization made is indeed correct).
> 2. The PR introduced quadratic prediction time in terms of the number of 
> trees, which was previously linear. This issue needs to be investigated for 
> whether it causes problems for large numbers of trees (say, 1000), an 
> appropriate test should be added, and then fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19114) Backpressure rate is cast from double to long to double

2017-01-06 Thread Tony Novak (JIRA)
Tony Novak created SPARK-19114:
--

 Summary: Backpressure rate is cast from double to long to double
 Key: SPARK-19114
 URL: https://issues.apache.org/jira/browse/SPARK-19114
 Project: Spark
  Issue Type: Bug
Reporter: Tony Novak


We have a Spark streaming job where each record takes well over a second to 
execute, so the stable rate is under 1 element/second. We set 
spark.streaming.backpressure.enabled=true and 
spark.streaming.backpressure.pid.minRate=0.1, but backpressure did not appear 
to be effective, even though the TRACE level logs from PIDRateEstimator showed 
that the new rate was 0.1.

As it turns out, even though the minRate parameter is a Double, and the rate 
estimate generated by PIDRateEstimator is a Double as well, RateController 
casts the new rate to a Long. As a result, if the computed rate is less than 1, 
it's truncated to 0, which ends up being interpreted as "no limit".

What's particularly confusing is that the Guava RateLimiter class takes a rate 
limit as a double, so the long value ends up being cast back to a double.

Is there any reason not to keep the rate limit as a double all the way through? 
I'm happy to create a pull request if this makes sense.

We encountered the bug on Spark 1.6.2, but it looks like the code in the master 
branch is still affected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee

2017-01-06 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804425#comment-15804425
 ] 

Gaurav Shah edited comment on SPARK-9215 at 1/7/17 2:04 AM:


[~tdas] I know this is an old pull request but was still wondering if you can 
help. I was wondering can we enhance this to make sure that we checkpoint only 
after blocks of data has been written. So we need not implement Spark 
checkpoint in the first place. Each block has a start and end seq number.


was (Author: gaurav24):
[~tdas] I know this is an old pull request but was still wondering if you can 
help. I was wondering can we enhance this to make sure that we checkpoint only 
after blocks of data has been written. So we need to implement Spark checkpoint 
in the first place. Each block has a start and end seq number.

> Implement WAL-free Kinesis receiver that give at-least once guarantee
> -
>
> Key: SPARK-9215
> URL: https://issues.apache.org/jira/browse/SPARK-9215
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.5.0
>
>
> Currently, the KinesisReceiver can loose some data in the case of certain 
> failures (receiver and driver failures). Using the write ahead logs can 
> mitigate some of the problem, but it is not ideal because WALs dont work with 
> S3 (eventually consistency, etc.) which is the most likely file system to be 
> used in the EC2 environment. Hence, we have to take a different approach to 
> improving reliability for Kinesis.
> Detailed design doc - 
> https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2017-01-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18372:

Assignee: mingjie tang

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
>Assignee: mingjie tang
> Fix For: 1.6.4
>
> Attachments: _thumb_37664.png
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up

2017-01-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18372.
-
   Resolution: Resolved
Fix Version/s: 1.6.4

> .Hive-staging folders created from Spark hiveContext are not getting cleaned 
> up
> ---
>
> Key: SPARK-18372
> URL: https://issues.apache.org/jira/browse/SPARK-18372
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 1.6.3
> Environment: spark standalone and spark yarn 
>Reporter: mingjie tang
> Fix For: 1.6.4
>
> Attachments: _thumb_37664.png
>
>
> Steps to reproduce:
> 
> 1. Launch spark-shell 
> 2. Run the following scala code via Spark-Shell 
> scala> val hivesampletabledf = sqlContext.table("hivesampletable") 
> scala> import org.apache.spark.sql.DataFrameWriter 
> scala> val dfw : DataFrameWriter = hivesampletabledf.write 
> scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( 
> clientid string, querytime string, market string, deviceplatform string, 
> devicemake string, devicemodel string, state string, country string, 
> querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") 
> scala> dfw.insertInto("hivesampletablecopypy") 
> scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, 
> querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE 
> state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """)
> hivesampletablecopypydfdf.show
> 3. in HDFS (in our case, WASB), we can see the following folders 
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1
>  
> hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693
> the issue is that these don't get cleaned up and get accumulated
> =
> with the customer, we have tried setting "SET 
> hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any 
> difference.
> .hive-staging folders are created under the  folder - 
> hive/warehouse/hivesampletablecopypy/
> we have tried adding this property to hive-site.xml and restart the 
> components -
>  
> hive.exec.stagingdir 
> $ {hive.exec.scratchdir}
> /$
> {user.name}
> /.staging 
> 
> a new .hive-staging folder was created in hive/warehouse/ folder
> moreover, please understand that if we run the hive query in pure Hive via 
> Hive CLI on the same Spark cluster, we don't see the behavior
> so it doesn't appear to be a Hive issue/behavior in this case- this is a 
> spark behavior
> I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark 
> configuration already
> The issue happens via Spark-submit as well - customer used the following 
> command to reproduce this -
> spark-submit test-hive-staging-cleanup.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-01-06 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806368#comment-15806368
 ] 

Ilya Matiach commented on SPARK-17975:
--

I was able to reproduce the issue based on your dataset and I've made the 
suggested fix in the pull request.  I added a test case that had a similar 
issue to your dataset and could reproduce the error.  Thank you!

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA

[jira] [Assigned] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17975:


Assignee: (was: Apache Spark)

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17975:


Assignee: Apache Spark

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
>Assignee: Apache Spark
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806365#comment-15806365
 ] 

Apache Spark commented on SPARK-17975:
--

User 'imatiach-msft' has created a pull request for this issue:
https://github.com/apache/spark/pull/16494

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Assigned] (SPARK-19093) Cached tables are not used in SubqueryExpression

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19093:


Assignee: (was: Apache Spark)

> Cached tables are not used in SubqueryExpression
> 
>
> Key: SPARK-19093
> URL: https://issues.apache.org/jira/browse/SPARK-19093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Josh Rosen
>
> See reproduction at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html
> Consider the following:
> {code}
> Seq(("a", "b"), ("c", "d"))
>   .toDS
>   .write
>   .parquet("/tmp/rows")
> val df = spark.read.parquet("/tmp/rows")
> df.cache()
> df.count()
> df.createOrReplaceTempView("rows")
> spark.sql("""
>   select * from rows cross join rows
> """).explain(true)
> spark.sql("""
>   select * from rows where not exists (select * from rows)
> """).explain(true)
> {code}
> In both plans, I'd expect that both sides of the joins would read from the 
> cached table for both the cross join and anti join, but the left anti join 
> produces the following plan which only reads the left side from cache and 
> reads the right side via a regular non-cahced scan:
> {code}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter NOT exists#3994
>:  +- 'Project [*]
>: +- 'UnresolvedRelation `rows`
>+- 'UnresolvedRelation `rows`
> == Analyzed Logical Plan ==
> _1: string, _2: string
> Project [_1#3775, _2#3776]
> +- Filter NOT predicate-subquery#3994 []
>:  +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>: +- Project [_1#3775, _2#3776]
>:+- SubqueryAlias rows
>:   +- Relation[_1#3775,_2#3776] parquet
>+- SubqueryAlias rows
>   +- Relation[_1#3775,_2#3776] parquet
> == Optimized Logical Plan ==
> Join LeftAnti
> :- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>+- Relation[_1#3775,_2#3776] parquet
> == Physical Plan ==
> BroadcastNestedLoopJoin BuildRight, LeftAnti
> :- InMemoryTableScan [_1#3775, _2#3776]
> : +- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> :   +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- BroadcastExchange IdentityBroadcastMode
>+- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>   +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19093) Cached tables are not used in SubqueryExpression

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806285#comment-15806285
 ] 

Apache Spark commented on SPARK-19093:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/16493

> Cached tables are not used in SubqueryExpression
> 
>
> Key: SPARK-19093
> URL: https://issues.apache.org/jira/browse/SPARK-19093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Josh Rosen
>
> See reproduction at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html
> Consider the following:
> {code}
> Seq(("a", "b"), ("c", "d"))
>   .toDS
>   .write
>   .parquet("/tmp/rows")
> val df = spark.read.parquet("/tmp/rows")
> df.cache()
> df.count()
> df.createOrReplaceTempView("rows")
> spark.sql("""
>   select * from rows cross join rows
> """).explain(true)
> spark.sql("""
>   select * from rows where not exists (select * from rows)
> """).explain(true)
> {code}
> In both plans, I'd expect that both sides of the joins would read from the 
> cached table for both the cross join and anti join, but the left anti join 
> produces the following plan which only reads the left side from cache and 
> reads the right side via a regular non-cahced scan:
> {code}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter NOT exists#3994
>:  +- 'Project [*]
>: +- 'UnresolvedRelation `rows`
>+- 'UnresolvedRelation `rows`
> == Analyzed Logical Plan ==
> _1: string, _2: string
> Project [_1#3775, _2#3776]
> +- Filter NOT predicate-subquery#3994 []
>:  +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>: +- Project [_1#3775, _2#3776]
>:+- SubqueryAlias rows
>:   +- Relation[_1#3775,_2#3776] parquet
>+- SubqueryAlias rows
>   +- Relation[_1#3775,_2#3776] parquet
> == Optimized Logical Plan ==
> Join LeftAnti
> :- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>+- Relation[_1#3775,_2#3776] parquet
> == Physical Plan ==
> BroadcastNestedLoopJoin BuildRight, LeftAnti
> :- InMemoryTableScan [_1#3775, _2#3776]
> : +- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> :   +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- BroadcastExchange IdentityBroadcastMode
>+- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>   +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19093) Cached tables are not used in SubqueryExpression

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19093:


Assignee: Apache Spark

> Cached tables are not used in SubqueryExpression
> 
>
> Key: SPARK-19093
> URL: https://issues.apache.org/jira/browse/SPARK-19093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> See reproduction at 
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html
> Consider the following:
> {code}
> Seq(("a", "b"), ("c", "d"))
>   .toDS
>   .write
>   .parquet("/tmp/rows")
> val df = spark.read.parquet("/tmp/rows")
> df.cache()
> df.count()
> df.createOrReplaceTempView("rows")
> spark.sql("""
>   select * from rows cross join rows
> """).explain(true)
> spark.sql("""
>   select * from rows where not exists (select * from rows)
> """).explain(true)
> {code}
> In both plans, I'd expect that both sides of the joins would read from the 
> cached table for both the cross join and anti join, but the left anti join 
> produces the following plan which only reads the left side from cache and 
> reads the right side via a regular non-cahced scan:
> {code}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter NOT exists#3994
>:  +- 'Project [*]
>: +- 'UnresolvedRelation `rows`
>+- 'UnresolvedRelation `rows`
> == Analyzed Logical Plan ==
> _1: string, _2: string
> Project [_1#3775, _2#3776]
> +- Filter NOT predicate-subquery#3994 []
>:  +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>: +- Project [_1#3775, _2#3776]
>:+- SubqueryAlias rows
>:   +- Relation[_1#3775,_2#3776] parquet
>+- SubqueryAlias rows
>   +- Relation[_1#3775,_2#3776] parquet
> == Optimized Logical Plan ==
> Join LeftAnti
> :- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>+- Relation[_1#3775,_2#3776] parquet
> == Physical Plan ==
> BroadcastNestedLoopJoin BuildRight, LeftAnti
> :- InMemoryTableScan [_1#3775, _2#3776]
> : +- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, 
> memory, deserialized, 1 replicas)
> :   +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> +- BroadcastExchange IdentityBroadcastMode
>+- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
>   +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2017-01-06 Thread Randall Whitman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806218#comment-15806218
 ] 

Randall Whitman commented on SPARK-7768:


Would you mind commenting on how the updated UDT works with a class of an 
unmodified third-party library?  Thanks in advance.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7768) Make user-defined type (UDT) API public

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7768:
---

Assignee: Apache Spark

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7768) Make user-defined type (UDT) API public

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7768:
---

Assignee: (was: Apache Spark)

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806167#comment-15806167
 ] 

Apache Spark commented on SPARK-7768:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/16478

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19111) S3 Mesos history upload fails if too large

2017-01-06 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-19111:
--
Summary: S3 Mesos history upload fails if too large  (was: S3 Mesos history 
upload fails if too large or if distributed datastore is misbehaving)

> S3 Mesos history upload fails if too large
> --
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 INFO [main] 
> 

[jira] [Assigned] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19113:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source 
> should be sent to the user
> -
>
> Key: SPARK-19113
> URL: https://issues.apache.org/jira/browse/SPARK-19113
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806085#comment-15806085
 ] 

Apache Spark commented on SPARK-19113:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16492

> Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source 
> should be sent to the user
> -
>
> Key: SPARK-19113
> URL: https://issues.apache.org/jira/browse/SPARK-19113
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19113:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source 
> should be sent to the user
> -
>
> Key: SPARK-19113
> URL: https://issues.apache.org/jira/browse/SPARK-19113
> Project: Spark
>  Issue Type: Test
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19111) S3 Mesos history upload fails silently if too large

2017-01-06 Thread Charles Allen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Allen updated SPARK-19111:
--
Summary: S3 Mesos history upload fails silently if too large  (was: S3 
Mesos history upload fails if too large)

> S3 Mesos history upload fails silently if too large
> ---
>
> Key: SPARK-19111
> URL: https://issues.apache.org/jira/browse/SPARK-19111
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Mesos, Spark Core
>Affects Versions: 2.0.0
>Reporter: Charles Allen
>
> {code}
> 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped 
> Spark web UI at http://REDACTED:4041
> 2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.jvmGCTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBlocksFetched
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSerializationTime
> 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(
> 364,WrappedArray())
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.resultSize
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.peakExecutionMemory
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.fetchWaitTime
> 2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.memoryBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.diskBytesSpilled
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.localBytesRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.recordsRead
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorDeserializeTime
> 2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.executorRunTime
> 2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
> com.metamx.starfire.spark.SparkDriver - emitting metric: 
> internal.metrics.shuffle.read.remoteBlocksFetched
> 2017-01-06T21:32:32,943 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
> 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
> 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
> org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
> stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
> {code}
> Running spark on mesos, some large jobs fail to upload to the history server 
> storage!
> A successful sequence of events in the log that yield an upload are as 
> follows:
> {code}
> 2017-01-06T19:14:32,925 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
> 2017-01-06T21:59:14,789 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
> 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
> closed. Now beginning upload
> 2017-01-06T21:59:44,679 INFO [main] 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem 

[jira] [Created] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user

2017-01-06 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19113:


 Summary: Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal 
errors from a source should be sent to the user
 Key: SPARK-19113
 URL: https://issues.apache.org/jira/browse/SPARK-19113
 Project: Spark
  Issue Type: Test
  Components: Structured Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19112) add codec for ZStandard

2017-01-06 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-19112:
-

 Summary: add codec for ZStandard
 Key: SPARK-19112
 URL: https://issues.apache.org/jira/browse/SPARK-19112
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Thomas Graves


ZStandard: https://github.com/facebook/zstd and http://facebook.github.io/zstd/ 
has been in use for a while now. v1.0 was recently released. Hadoop 
(https://issues.apache.org/jira/browse/HADOOP-13578) and others 
(https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it.

Zstd seems to give great results => Gzip level Compression with Lz4 level CPU.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19111) S3 Mesos history upload fails if too large or if distributed datastore is misbehaving

2017-01-06 Thread Charles Allen (JIRA)
Charles Allen created SPARK-19111:
-

 Summary: S3 Mesos history upload fails if too large or if 
distributed datastore is misbehaving
 Key: SPARK-19111
 URL: https://issues.apache.org/jira/browse/SPARK-19111
 Project: Spark
  Issue Type: Bug
  Components: EC2, Mesos, Spark Core
Affects Versions: 2.0.0
Reporter: Charles Allen


{code}
2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped Spark 
web UI at http://REDACTED:4041
2017-01-06T21:32:32,938 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.jvmGCTime
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.localBlocksFetched
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.resultSerializationTime
2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] 
org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(
364,WrappedArray())
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.resultSize
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.peakExecutionMemory
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.fetchWaitTime
2017-01-06T21:32:32,939 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.memoryBytesSpilled
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.remoteBytesRead
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.diskBytesSpilled
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.localBytesRead
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.recordsRead
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.executorDeserializeTime
2017-01-06T21:32:32,940 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes
2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.executorRunTime
2017-01-06T21:32:32,941 INFO [SparkListenerBus] 
com.metamx.starfire.spark.SparkDriver - emitting metric: 
internal.metrics.shuffle.read.remoteBlocksFetched
2017-01-06T21:32:32,943 INFO [main] 
org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' 
closed. Now beginning upload
2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] 
org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray())
2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] 
org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray())
2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] 
org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already 
stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray())
{code}

Running spark on mesos, some large jobs fail to upload to the history server 
storage!

A successful sequence of events in the log that yield an upload are as follows:

{code}
2017-01-06T19:14:32,925 INFO [main] 
org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp'
2017-01-06T21:59:14,789 INFO [main] 
org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' 
closed. Now beginning upload
2017-01-06T21:59:44,679 INFO [main] 
org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 
'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' upload 
complete
{code}

But large jobs do not ever get to the {{upload complete}} log message, and 
instead exit before completion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Closed] (SPARK-18710) Add offset to GeneralizedLinearRegression models

2017-01-06 Thread Wayne Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang closed SPARK-18710.
---
Resolution: Unresolved

> Add offset to GeneralizedLinearRegression models
> 
>
> Key: SPARK-18710
> URL: https://issues.apache.org/jira/browse/SPARK-18710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: features
>   Original Estimate: 10h
>  Remaining Estimate: 10h
>
> The current GeneralizedLinearRegression model does not support offset. The 
> offset can be useful to take into account exposure, or for testing 
> incremental effect of new variables. It is possible to use weights in current 
> environment to achieve the same effect of specifying offset for certain 
> models, e.g., Poisson & Binomial with log offset, it is desirable to have the 
> offset option to work with more general cases, e.g., negative offset or 
> offset that is hard to specify using weights (e.g., offset to the probability 
> rather than odds in logistic regression).
> Effort would involve:
> * update regression class to support offsetCol
> * update IWLS to take into account of offset
> * add test case for offset
> I can start working on this if the community approves this feature. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19110:


Assignee: Apache Spark

> DistributedLDAModel returns different logPrior for original and loaded model
> 
>
> Key: SPARK-19110
> URL: https://issues.apache.org/jira/browse/SPARK-19110
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Miao Wang
>Assignee: Apache Spark
>
> While adding DistributedLDAModel training summary for SparkR, I found that 
> the logPrior for original and loaded model is different.
> For example, in the test("read/write DistributedLDAModel"), I add the test:
> val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
>   val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
>   assert(logPrior === logPrior2)
> The test fails:
> -4.394180878889078 did not equal -4.294290536919573



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805688#comment-15805688
 ] 

Apache Spark commented on SPARK-19110:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/16491

> DistributedLDAModel returns different logPrior for original and loaded model
> 
>
> Key: SPARK-19110
> URL: https://issues.apache.org/jira/browse/SPARK-19110
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Miao Wang
>
> While adding DistributedLDAModel training summary for SparkR, I found that 
> the logPrior for original and loaded model is different.
> For example, in the test("read/write DistributedLDAModel"), I add the test:
> val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
>   val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
>   assert(logPrior === logPrior2)
> The test fails:
> -4.394180878889078 did not equal -4.294290536919573



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19110:


Assignee: (was: Apache Spark)

> DistributedLDAModel returns different logPrior for original and loaded model
> 
>
> Key: SPARK-19110
> URL: https://issues.apache.org/jira/browse/SPARK-19110
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Miao Wang
>
> While adding DistributedLDAModel training summary for SparkR, I found that 
> the logPrior for original and loaded model is different.
> For example, in the test("read/write DistributedLDAModel"), I add the test:
> val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
>   val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
>   assert(logPrior === logPrior2)
> The test fails:
> -4.394180878889078 did not equal -4.294290536919573



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model

2017-01-06 Thread Miao Wang (JIRA)
Miao Wang created SPARK-19110:
-

 Summary: DistributedLDAModel returns different logPrior for 
original and loaded model
 Key: SPARK-19110
 URL: https://issues.apache.org/jira/browse/SPARK-19110
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Reporter: Miao Wang


While adding DistributedLDAModel training summary for SparkR, I found that the 
logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
  val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
  assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19109) ORC metadata section can sometimes exceed protobuf message size limit

2017-01-06 Thread Nic Eggert (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Eggert updated SPARK-19109:
---
Description: 
Basically, Spark inherits HIVE-11592 from its Hive dependency. From that issue:

If there are too many small stripes and with many columns, the overhead for 
storing metadata (column stats) can exceed the default protobuf message size of 
64MB. Reading such files will throw the following exception
{code}
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: 
Protocol message was too large.  May be malicious.  Use 
CodedInputStream.setSizeLimit() to increase the size limit.
at 
com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110)
at 
com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755)
at 
com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811)
at 
com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1331)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1281)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369)
at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4887)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4803)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985)
at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12925)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12872)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956)
at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13599)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13546)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630)
at 
com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.(ReaderImpl.java:468)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:314)
at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{code}

This is fixed in Hive 1.3, so it should be fairly straightforward to pick up 
the patch.

As a side note: Spark's management of its Hive fork/dependency seems incredibly 
arcane to me. Surely there's a better way than publishing to central from 
developers' personal repos.

  was:
Basically, Spark inherits HIVE-11592 from its Hive dependency. From that issue:

If there are too many small stripes and with many columns, the overhead for 
storing metadata (column stats) can exceed the default protobuf message size of 
64MB. Reading such files will throw the following exception
{code}
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: 
Protocol message was too large.  May be malicious.  Use 
CodedInputStream.setSizeLimit() to increase the size limit.
at 

[jira] [Created] (SPARK-19109) ORC metadata section can sometimes exceed protobuf message size limit

2017-01-06 Thread Nic Eggert (JIRA)
Nic Eggert created SPARK-19109:
--

 Summary: ORC metadata section can sometimes exceed protobuf 
message size limit
 Key: SPARK-19109
 URL: https://issues.apache.org/jira/browse/SPARK-19109
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 1.6.3, 2.2.0
Reporter: Nic Eggert


Basically, Spark inherits HIVE-11592 from its Hive dependency. From that issue:

If there are too many small stripes and with many columns, the overhead for 
storing metadata (column stats) can exceed the default protobuf message size of 
64MB. Reading such files will throw the following exception
{code}
Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: 
Protocol message was too large.  May be malicious.  Use 
CodedInputStream.setSizeLimit() to increase the size limit.
at 
com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110)
at 
com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755)
at 
com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811)
at 
com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1331)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1281)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369)
at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4887)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4803)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985)
at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12925)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12872)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956)
at 
com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13599)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13546)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630)
at 
com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223)
at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.(ReaderImpl.java:468)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:314)
at 
org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228)
at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
{code}

This is fixed in Hive 1.3, so it should be fairly straightforward to pick up 
the patch.

As a side note: Spark's management of its Hive fork/dependency seems incredibly 
arcane to me. Surely there's a better way than publishing to central from 
developers personal repos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6099) Stabilize mllib ClassificationModel, RegressionModel APIs

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6099.
--
Resolution: Done

> Stabilize mllib ClassificationModel, RegressionModel APIs
> -
>
> Key: SPARK-6099
> URL: https://issues.apache.org/jira/browse/SPARK-6099
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> The abstractions spark.mllib.classification.ClassificationModel and 
> spark.mllib.regression.RegressionModel have been Experimental for a while.  
> This is a problem since some of the implementing classes are not Experimental 
> (e.g., LogisticRegressionModel).
> We should finalize the API and make it non-Experimental ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6098) Propagate Experimental tag to child classes

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6098.
--
Resolution: Not A Problem

This is defunct now that these aren't even experimental

> Propagate Experimental tag to child classes
> ---
>
> Key: SPARK-6098
> URL: https://issues.apache.org/jira/browse/SPARK-6098
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Issue: An abstraction (e.g., mllib.classification.ClassificationModel) may be 
> Experimental even when its implementing classes (e.g., 
> mllib.classification.LogisticRegressionModel) are not.
> Proposal: That tag should be propagated to child classes (or better yet to 
> the relevant parts of the child classes).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-01-06 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-18890:
---
Issue Type: Improvement  (was: Bug)

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-01-06 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805495#comment-15805495
 ] 

Kay Ousterhout commented on SPARK-18890:


I just opened SPARK-19108 for the broadcast issue.  In the meantime, after 
thinking about this more (and also based on your comments on the associated PRs 
Imran) I think we should go ahead and merge this change to consolidate the 
serialization in one place.  If nothing else, that change makes the code more 
readable, and I suspect will make it easier to implement further optimizations 
to the serialization in the future.

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)

2017-01-06 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19108:
--

 Summary: Broadcast all shared parts of tasks (to reduce task 
serialization time)
 Key: SPARK-19108
 URL: https://issues.apache.org/jira/browse/SPARK-19108
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Reporter: Kay Ousterhout


Expand the amount of information that's broadcasted for tasks, to avoid 
serializing data per-task that should only be sent to each executor once for 
the entire stage.

Conceptually, this means we'd have new classes  specially for sending the 
minimal necessary data to the executor, like:

{code}
/**
  * metadata about the taskset needed by the executor for all tasks in this 
taskset.  Subset of the
  * full data kept on the driver to make it faster to serialize and send to 
executors.
  */
class ExecutorTaskSetMeta(
  val stageId: Int,
  val stageAttemptId: Int,
  val properties: Properties,
  val addedFiles: Map[String, String],
  val addedJars: Map[String, String]
  // maybe task metrics here?
)

class ExecutorTaskData(
  val partitionId: Int,
  val attemptNumber: Int,
  val taskId: Long,
  val taskBinary: Broadcast[Array[Byte]],
  val taskSetMeta: Broadcast[ExecutorTaskSetMeta]
)
{code}

Then all the info you'd need to send to the executors would be a serialized 
version of ExecutorTaskData.  Furthermore, given the simplicity of that class, 
you could serialize manually, and then for each task you could just modify the 
first two ints & one long directly in the byte buffer.  (You could do the same 
trick for serialization even if ExecutorTaskSetMeta was not a broadcast, but 
that will keep the msgs small as well.)

There a bunch of details I'm skipping here: you'd also need to do some special 
handling for the TaskMetrics; the way tasks get started in the executor would 
change; you'd also need to refactor {{Task}} to let it get reconstructed from 
this information (or add more to ExecutorTaskSetMeta); and probably other 
details I'm overlooking now.

(this is copied from SPARK-18890 and [~imranr]'s comment there; cc [~shivaram])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19074) Update Structured Streaming Programming guide for Update Mode

2017-01-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-19074.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16468
[https://github.com/apache/spark/pull/16468]

> Update Structured Streaming Programming guide for Update Mode
> -
>
> Key: SPARK-19074
> URL: https://issues.apache.org/jira/browse/SPARK-19074
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3937) Unsafe memory access inside of Snappy library

2017-01-06 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout closed SPARK-3937.
-

> Unsafe memory access inside of Snappy library
> -
>
> Key: SPARK-3937
> URL: https://issues.apache.org/jira/browse/SPARK-3937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Patrick Wendell
>
> This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
> have much information about this other than the stack trace. However, it was 
> concerning enough I figured I should post it.
> {code}
> java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code
> org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
> org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
> org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
> 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
> 
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
> 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
> 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> scala.collection.AbstractIterator.to(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Resolved] (SPARK-3937) Unsafe memory access inside of Snappy library

2017-01-06 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-3937.
---
Resolution: Won't Fix

Closing this due to lack of activity / reports of issues on recent versions of 
Spark

> Unsafe memory access inside of Snappy library
> -
>
> Key: SPARK-3937
> URL: https://issues.apache.org/jira/browse/SPARK-3937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Patrick Wendell
>
> This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't 
> have much information about this other than the stack trace. However, it was 
> concerning enough I figured I should post it.
> {code}
> java.lang.InternalError: a fault occurred in a recent unsafe memory access 
> operation in compiled Java code
> org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
> org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444)
> org.xerial.snappy.Snappy.uncompress(Snappy.java:480)
> 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355)
> 
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
> 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
> 
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712)
> 
> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)
> 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> scala.collection.AbstractIterator.to(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> 
> org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-06 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805366#comment-15805366
 ] 

Seth Hendrickson commented on SPARK-10078:
--

As a part of [SPARK-17136|https://issues.apache.org/jira/browse/SPARK-17136]. I 
am looking into a design for generic optimizer interface for Spark.ML. This 
should ideally, be abstracted such that, as Yanbo mentioned, users can switch 
between them easily. I don't think adding this to Breeze is important since we 
hope to add our own interface directly into Spark.

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6099) Stabilize mllib ClassificationModel, RegressionModel APIs

2017-01-06 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805232#comment-15805232
 ] 

Ilya Matiach commented on SPARK-6099:
-

It doesn't look like the API's are experimental anymore, can this JIRA be 
closed?

> Stabilize mllib ClassificationModel, RegressionModel APIs
> -
>
> Key: SPARK-6099
> URL: https://issues.apache.org/jira/browse/SPARK-6099
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> The abstractions spark.mllib.classification.ClassificationModel and 
> spark.mllib.regression.RegressionModel have been Experimental for a while.  
> This is a problem since some of the implementing classes are not Experimental 
> (e.g., LogisticRegressionModel).
> We should finalize the API and make it non-Experimental ASAP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5493) Support proxy users under kerberos

2017-01-06 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805205#comment-15805205
 ] 

Marcelo Vanzin commented on SPARK-5493:
---

> Since keytab will be owned by a "service" account, not by proxied users, and 
> keytab file will have proper OS permissions, not sure I'm following how 
> keytab would be exposed to those proxied users?

When you use the principal / keytab options in spark-submit, Spark uploads the 
keytab to HDFS, under the user running the application (in this case, the proxy 
user).


> Support proxy users under kerberos
> --
>
> Key: SPARK-5493
> URL: https://issues.apache.org/jira/browse/SPARK-5493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Brock Noland
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0
>
>
> When using kerberos, services may want to use spark-submit to submit jobs as 
> a separate user. For example a service like hive might want to submit jobs as 
> a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5493) Support proxy users under kerberos

2017-01-06 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805198#comment-15805198
 ] 

Ruslan Dautkhanov edited comment on SPARK-5493 at 1/6/17 6:19 PM:
--

{quote}There might be ways to hack support for that without changes in Spark, 
but I'd like to see a proper API in Spark for distributing new delegation 
tokens. I mentioned that in SPARK-14743, but although that bug is closed, that 
particular feature hasn't been implemented yet.
{quote}

[~vanzin], would it be possible to submit a new jira for this part that didn't 
get implemented in SPARK-14743?
Thank you.

{quote}
There might be ways to hack support for that without changes in Spark
{quote}
Something like ssh'ing regularly into all Hadoop nodes under proxied user id 
and running kinit?
Yep, a proper API would be better here.

{quote}
It would expose the keytab to the proxied user, which in 99% of the cases is 
not wanted.
{quote}
Since keytab will be owned by a "service" account, not by proxied users, and 
keytab file will have proper 
OS permissions, not sure I'm following how keytab would be exposed to those 
proxied users? Could you please elaborate. Proxy authentication is only for 
Hadoop services. keytab is just a file and we could rely on OS permissions to 
lock its access. I'm probably missing something here.


was (Author: tagar):
{quote}There might be ways to hack support for that without changes in Spark, 
but I'd like to see a proper API in Spark for distributing new delegation 
tokens. I mentioned that in SPARK-14743, but although that bug is closed, that 
particular feature hasn't been implemented yet.
{quote}

[~vanzin], would it be possible to submit a new jira for this part that didn't 
get implemented in SPARK-14743?
Thank you.

{quote}
There might be ways to hack support for that without changes in Spark
{quote}
Something like ssh'ing regularly into all Hadoop nodes under proxied user id 
and running kinit?
Yep, a proper API would be better here.

{quote}
It would expose the keytab to the proxied user, which in 99% of the cases is 
not wanted.
{quote}
Since keytab will be owned by a "service" account, not by proxied users, and 
keytab file will have proper 
OS permissions, not sure I'm following how keytab would be exposed to those 
proxied users? Could you please elaborate. Proxy authentication is only for 
Hadoop services. keytab is just a file and we could rely on OS permissions to 
lock its access, relying on regular OS permissions. I'm probably missing 
something here.

> Support proxy users under kerberos
> --
>
> Key: SPARK-5493
> URL: https://issues.apache.org/jira/browse/SPARK-5493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Brock Noland
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0
>
>
> When using kerberos, services may want to use spark-submit to submit jobs as 
> a separate user. For example a service like hive might want to submit jobs as 
> a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5493) Support proxy users under kerberos

2017-01-06 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805198#comment-15805198
 ] 

Ruslan Dautkhanov commented on SPARK-5493:
--

{quote}There might be ways to hack support for that without changes in Spark, 
but I'd like to see a proper API in Spark for distributing new delegation 
tokens. I mentioned that in SPARK-14743, but although that bug is closed, that 
particular feature hasn't been implemented yet.
{quote}

[~vanzin], would it be possible to submit a new jira for this part that didn't 
get implemented in SPARK-14743?
Thank you.

{quote}
There might be ways to hack support for that without changes in Spark
{quote}
Something like ssh'ing regularly into all Hadoop nodes under proxied user id 
and running kinit?
Yep, a proper API would be better here.

{quote}
It would expose the keytab to the proxied user, which in 99% of the cases is 
not wanted.
{quote}
Since keytab will be owned by a "service" account, not by proxied users, and 
keytab file will have proper 
OS permissions, not sure I'm following how keytab would be exposed to those 
proxied users? Could you please elaborate. Proxy authentication is only for 
Hadoop services. keytab is just a file and we could rely on OS permissions to 
lock its access, relying on regular OS permissions. I'm probably missing 
something here.

> Support proxy users under kerberos
> --
>
> Key: SPARK-5493
> URL: https://issues.apache.org/jira/browse/SPARK-5493
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Brock Noland
>Assignee: Marcelo Vanzin
> Fix For: 1.3.0
>
>
> When using kerberos, services may want to use spark-submit to submit jobs as 
> a separate user. For example a service like hive might want to submit jobs as 
> a client user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-01-06 Thread Ilya Matiach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805169#comment-15805169
 ] 

Ilya Matiach commented on SPARK-11968:
--

Can someone with permissions change the status from In Progress to Open - as 
the pull request sent was closed and the issue still exists.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19083) sbin/start-history-server.sh scripts use of $@ without ""

2017-01-06 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19083.

   Resolution: Fixed
 Assignee: zuotingbing
Fix Version/s: 2.2.0
   2.1.1

> sbin/start-history-server.sh scripts use of $@ without ""
> -
>
> Key: SPARK-19083
> URL: https://issues.apache.org/jira/browse/SPARK-19083
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: linux
>Reporter: zuotingbing
>Assignee: zuotingbing
>Priority: Trivial
> Fix For: 2.1.1, 2.2.0
>
>
> sbin/start-history-server.sh script use of $@ without "" , this will affect 
> the length of args which used in HistoryServerArguments::parse(args: 
> List[String])
> should write as follows:
> exec ... org.apache.spark.deploy.history.HistoryServer 1 "$@"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-01-06 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805087#comment-15805087
 ] 

Imran Rashid commented on SPARK-18890:
--

[~gq] I think you misunderstood my suggestion about using a broadcast for the 
task.  I'm not suggesting using a broadcast to contain *all* the task 
information, only the information which is shared across all tasks in a 
taskset. eg., the preferred location is ignored on the executor, so we wouldn't 
even bother serializing it either.  Conceptually, this means we'd have new 
classes  specially for sending the minimal necessary data to the executor, like:

{code}
/**
  * metadata about the taskset needed by the executor for all tasks in this 
taskset.  Subset of the
  * full data kept on the driver to make it faster to serialize and send to 
executors.
  */
class ExecutorTaskSetMeta(
  val stageId: Int,
  val stageAttemptId: Int,
  val properties: Properties,
  val addedFiles: Map[String, String],
  val addedJars: Map[String, String]
  // maybe task metrics here?
)

class ExecutorTaskData(
  val partitionId: Int,
  val attemptNumber: Int,
  val taskId: Long,
  val taskBinary: Broadcast[Array[Byte]],
  val taskSetMeta: Broadcast[ExecutorTaskSetMeta]
)
{code}

Then all the info you'd need to send to the executors would be a serialized 
version of ExecutorTaskData.  Furthermore, given the simplicity of that class, 
you could serialize manually, and then for each task you could just modify the 
first two ints & one long directly in the byte buffer.  (You could do the same 
trick for serialization even if ExecutorTaskSetMeta was not a broadcast, but 
that will keep the msgs small as well.)

There a bunch of details I'm skipping here: you'd also need to do some special 
handling for the TaskMetrics; the way tasks get started in the executor would 
change; you'd also need to refactor {{Task}} to let it get reconstructed from 
this information (or add more to ExecutorTaskSetMeta); and probably other 
details I'm overlooking now.

But if we really see task serialization as an issue, this seems like the right 
approach.

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19106) Styling for the configuration docs is broken

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19106:


Assignee: Apache Spark

> Styling for the configuration docs is broken
> 
>
> Key: SPARK-19106
> URL: https://issues.apache.org/jira/browse/SPARK-19106
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Trivial
> Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png
>
>
> There are several styling problems with the configuration docs, starting 
> roughly from the Scheduling section on down.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18836) Serialize Task Metrics once per stage

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805042#comment-15805042
 ] 

Apache Spark commented on SPARK-18836:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/16489

> Serialize Task Metrics once per stage
> -
>
> Key: SPARK-18836
> URL: https://issues.apache.org/jira/browse/SPARK-18836
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 2.2.0
>
>
> Right now we serialize the empty task metrics once per task -- Since this is 
> shared across all tasks we could use the same serialized task metrics across 
> all tasks of a stage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19106) Styling for the configuration docs is broken

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805043#comment-15805043
 ] 

Apache Spark commented on SPARK-19106:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16490

> Styling for the configuration docs is broken
> 
>
> Key: SPARK-19106
> URL: https://issues.apache.org/jira/browse/SPARK-19106
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Trivial
> Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png
>
>
> There are several styling problems with the configuration docs, starting 
> roughly from the Scheduling section on down.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19106) Styling for the configuration docs is broken

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19106:


Assignee: (was: Apache Spark)

> Styling for the configuration docs is broken
> 
>
> Key: SPARK-19106
> URL: https://issues.apache.org/jira/browse/SPARK-19106
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Trivial
> Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png
>
>
> There are several styling problems with the configuration docs, starting 
> roughly from the Scheduling section on down.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19106) Styling for the configuration docs is broken

2017-01-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805037#comment-15805037
 ] 

Sean Owen commented on SPARK-19106:
---

Yeah, the section headings aren't rendering as section titles. Not a big deal 
but should be fixed. PR coming.

> Styling for the configuration docs is broken
> 
>
> Key: SPARK-19106
> URL: https://issues.apache.org/jira/browse/SPARK-19106
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Trivial
> Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png
>
>
> There are several styling problems with the configuration docs, starting 
> roughly from the Scheduling section on down.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17931) taskScheduler has some unneeded serialization

2017-01-06 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-17931:
-
Assignee: Kay Ousterhout

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
>Assignee: Kay Ousterhout
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17931) taskScheduler has some unneeded serialization

2017-01-06 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-17931.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16053
[https://github.com/apache/spark/pull/16053]

> taskScheduler has some unneeded serialization
> -
>
> Key: SPARK-17931
> URL: https://issues.apache.org/jira/browse/SPARK-17931
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Guoqiang Li
> Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS

2017-01-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804899#comment-15804899
 ] 

Yanbo Liang edited comment on SPARK-10078 at 1/6/17 4:27 PM:
-

[~debasish83] We are aim to implement VL-BFGS as an optimizer for ~billion 
features in the peer position compared with Breeze LBFGS/OWLQN, and then ML 
algorithms can switch between them automatically based on the number of 
features. So an abstract interface between the algorithms and optimizers is 
absolutely necessary. 
To the VL-BFGS, I have a basic implementation at 
https://github.com/yanboliang/spark-vlbfgs, please feel free to review and 
comment the code. Thanks.


was (Author: yanboliang):
[~debasish83] We are aim to implement VL-BFGS as an optimizer which should be 
similar with Breeze LBFGS/OWLQN, and switching between them should be 
automatically based on the number of features. So an abstract interface between 
the algorithm and optimizer is really necessary. I have a basic implementation 
at https://github.com/yanboliang/spark-vlbfgs, please feel free to review and 
comment the code. Thanks.

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804899#comment-15804899
 ] 

Yanbo Liang commented on SPARK-10078:
-

[~debasish83] We are aim to implement VL-BFGS as an optimizer which should be 
similar with Breeze LBFGS/OWLQN, and switching between them should be 
automatically based on the number of features. So an abstract interface between 
the algorithm and optimizer is really necessary. I have a basic implementation 
at https://github.com/yanboliang/spark-vlbfgs, please feel free to review and 
comment the code. Thanks.

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19033) HistoryServer still uses old ACLs even if ACLs are updated

2017-01-06 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-19033.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> HistoryServer still uses old ACLs even if ACLs are updated
> --
>
> Key: SPARK-19033
> URL: https://issues.apache.org/jira/browse/SPARK-19033
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> In the current implementation of HistoryServer, Application ACLs is picked 
> from event log rather than configuration:
> {code}
> val uiAclsEnabled = 
> conf.getBoolean("spark.history.ui.acls.enable", false)
> ui.getSecurityManager.setAcls(uiAclsEnabled)
> // make sure to set admin acls before view acls so they are 
> properly picked up
> 
> ui.getSecurityManager.setAdminAcls(appListener.adminAcls.getOrElse(""))
> ui.getSecurityManager.setViewAcls(attempt.sparkUser,
>   appListener.viewAcls.getOrElse(""))
> 
> ui.getSecurityManager.setAdminAclsGroups(appListener.adminAclsGroups.getOrElse(""))
> 
> ui.getSecurityManager.setViewAclsGroups(appListener.viewAclsGroups.getOrElse(""))
> {code}
> This will become a problem when ACLs is updated (newly added admin), only the 
> new application can be effected, the old applications were still using the 
> old ACLs. So these new admin still cannot check the logs of old applications.
> It is hard to say this is a bug, but in our scenario this is not the expected 
> behavior we wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19033) HistoryServer still uses old ACLs even if ACLs are updated

2017-01-06 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-19033:
--
Assignee: Saisai Shao

> HistoryServer still uses old ACLs even if ACLs are updated
> --
>
> Key: SPARK-19033
> URL: https://issues.apache.org/jira/browse/SPARK-19033
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
>
> In the current implementation of HistoryServer, Application ACLs is picked 
> from event log rather than configuration:
> {code}
> val uiAclsEnabled = 
> conf.getBoolean("spark.history.ui.acls.enable", false)
> ui.getSecurityManager.setAcls(uiAclsEnabled)
> // make sure to set admin acls before view acls so they are 
> properly picked up
> 
> ui.getSecurityManager.setAdminAcls(appListener.adminAcls.getOrElse(""))
> ui.getSecurityManager.setViewAcls(attempt.sparkUser,
>   appListener.viewAcls.getOrElse(""))
> 
> ui.getSecurityManager.setAdminAclsGroups(appListener.adminAclsGroups.getOrElse(""))
> 
> ui.getSecurityManager.setViewAclsGroups(appListener.viewAclsGroups.getOrElse(""))
> {code}
> This will become a problem when ACLs is updated (newly added admin), only the 
> new application can be effected, the old applications were still using the 
> old ACLs. So these new admin still cannot check the logs of old applications.
> It is hard to say this is a bug, but in our scenario this is not the expected 
> behavior we wanted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10078) Vector-free L-BFGS

2017-01-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804882#comment-15804882
 ] 

Yanbo Liang commented on SPARK-10078:
-

[~sethah] The description is a little misleading, it means the VL-BFGS 
implementation can fit the current API. Feature partitioning (VL-BFGS) or not 
(Breeze LBFGS) will be choose automatically depends on the number of features. 
The purpose of VL-BFGS is not to replace Breeze LBFGS, but as a complementary 
method. Thanks.

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).
> Design document:
> https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19107) support creating hive table with DataFrameWriter and Catalog

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19107:


Assignee: Apache Spark  (was: Wenchen Fan)

> support creating hive table with DataFrameWriter and Catalog
> 
>
> Key: SPARK-19107
> URL: https://issues.apache.org/jira/browse/SPARK-19107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19107) support creating hive table with DataFrameWriter and Catalog

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19107:


Assignee: Wenchen Fan  (was: Apache Spark)

> support creating hive table with DataFrameWriter and Catalog
> 
>
> Key: SPARK-19107
> URL: https://issues.apache.org/jira/browse/SPARK-19107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19107) support creating hive table with DataFrameWriter and Catalog

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804809#comment-15804809
 ] 

Apache Spark commented on SPARK-19107:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16487

> support creating hive table with DataFrameWriter and Catalog
> 
>
> Key: SPARK-19107
> URL: https://issues.apache.org/jira/browse/SPARK-19107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19107) support creating hive table with DataFrameWriter and Catalog

2017-01-06 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19107:
---

 Summary: support creating hive table with DataFrameWriter and 
Catalog
 Key: SPARK-19107
 URL: https://issues.apache.org/jira/browse/SPARK-19107
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19106) Styling for the configuration docs is broken

2017-01-06 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-19106:


 Summary: Styling for the configuration docs is broken
 Key: SPARK-19106
 URL: https://issues.apache.org/jira/browse/SPARK-19106
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Nicholas Chammas
Priority: Trivial
 Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png

There are several styling problems with the configuration docs, starting 
roughly from the Scheduling section on down.

http://spark.apache.org/docs/latest/configuration.html





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19106) Styling for the configuration docs is broken

2017-01-06 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-19106:
-
Attachment: Screen Shot 2017-01-06 at 10.20.52 AM.png

> Styling for the configuration docs is broken
> 
>
> Key: SPARK-19106
> URL: https://issues.apache.org/jira/browse/SPARK-19106
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Trivial
> Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png
>
>
> There are several styling problems with the configuration docs, starting 
> roughly from the Scheduling section on down.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19086) Improper scoping of name resolution of columns in HAVING clause

2017-01-06 Thread Nattavut Sutyanyong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nattavut Sutyanyong closed SPARK-19086.
---
Resolution: Not A Problem

> Improper scoping of name resolution of columns in HAVING clause
> ---
>
> Key: SPARK-19086
> URL: https://issues.apache.org/jira/browse/SPARK-19086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nattavut Sutyanyong
>Priority: Minor
>
> There seems to be a problem on the scoping of name resolution of columns in a 
> HAVING clause.
> Here is a scenario of the problem:
> {code}
> // A simplified version of TC 01.13 from PR-16337
> Seq((1,1,1)).toDF("t1a", "t1b", "t1c").createOrReplaceTempView("t1")
> Seq((1,1,1)).toDF("t2a", "t2b", "t2c").createOrReplaceTempView("t2")
> // This is okay. 
> // Error: t2c is unresolved
> sql("select t2a from t2 group by t2a having t2c = 8").show
> // This is okay as t2c is resolved to the t2 on the parent side
> // because t2 in the subquery does not output column t2c.
> sql("select * from t2 where t2a in (select t2a from (select t2a from t2) t2 
> group by t2a having t2c = 8)").explain(true)
> // This is the problem.
> sql("select * from t2 where t2a in (select t2a from t2 group by t2a having 
> t2c = 8)").explain(true)
> == Analyzed Logical Plan ==
> t2a: int, t2b: int, t2c: int
> Project [t2a#22, t2b#23, t2c#24]
> +- Filter predicate-subquery#38 [(t2a#22 = t2a#22#49) && (t2c#24 = 8)]
>:  +- Project [t2a#22 AS t2a#22#49]
>: +- Aggregate [t2a#22], [t2a#22]
>:+- SubqueryAlias t2, `t2`
>:   +- Project [_1#18 AS t2a#22, _2#19 AS t2b#23, _3#20 AS t2c#24]
>:  +- LocalRelation [_1#18, _2#19, _3#20]
>+- SubqueryAlias t2, `t2`
>   +- Project [_1#18 AS t2a#22, _2#19 AS t2b#23, _3#20 AS t2c#24]
>  +- LocalRelation [_1#18, _2#19, _3#20]
> {code}
> We should not resolve {{t2c}} in the subquery to the outer {{t2}} on the 
> parent side. It should try to resolve {{t2c}} to the {{t2}} in the subquery 
> from its current scope and raise an exception because it is invalid to pull 
> up the column {{t2c}} from the {{Aggregate}} operator below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19099) Wrong time display on Spark History Server web UI

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19099:
--
Target Version/s:   (was: 2.1.0)
  Labels:   (was: none)
   Fix Version/s: (was: 2.1.1)

> Wrong time display on Spark History Server web UI
> -
>
> Key: SPARK-19099
> URL: https://issues.apache.org/jira/browse/SPARK-19099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.1.0
>Reporter: JohnsonZhang
>Priority: Trivial
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> While using the spark history server, I got a wrong job start time and end 
> time. I tracked the reason and found it's because the hard coding of TimeZone 
> rawOffSet. 
> I've changed it and acquire the offset value from System.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name

2017-01-06 Thread Peter Parente (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804545#comment-15804545
 ] 

Peter Parente commented on SPARK-19105:
---

>From chat in the PR, the keytab on AM does have the proper UUID suffix. The 
>HDFS staging area filename is a red herring and a misunderstanding on my part 
>about where the files are first distributed.

> yarn/Client.scala copyToRemote does not include keytab destination name
> ---
>
> Key: SPARK-19105
> URL: https://issues.apache.org/jira/browse/SPARK-19105
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: YARN in client mode
>Reporter: Peter Parente
>
> When I specify --principal user@REALM and --keytab /some/path/user.keytab, I 
> see the following in my app staging directory on HDFS:
> {code}
> -rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
> -rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
> -rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
> __spark_libs__4440821503780683972.zip
> -rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
> -rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
> {code}
> I also see that my spark.yarn.keytab config value has changed to 
> user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
> the keytab is unique within the app staging directory. However, from the 
> directory listing above, it's clear that the file written does not match this 
> new name. As a result, when it comes time to renew the Kerberos ticket, 
> AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed 
> name and also fails to renew the tickets.
> The problem looks to be in one call to [copyFileToRemote in 
> yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
>  that leaves off the destination filename param. The other calls in that 
> object which use copyFileToRemote and have a custom destination name all 
> provide this parameter (e.g., 
> https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name

2017-01-06 Thread Peter Parente (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Parente closed SPARK-19105.
-
Resolution: Invalid

> yarn/Client.scala copyToRemote does not include keytab destination name
> ---
>
> Key: SPARK-19105
> URL: https://issues.apache.org/jira/browse/SPARK-19105
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: YARN in client mode
>Reporter: Peter Parente
>
> When I specify --principal user@REALM and --keytab /some/path/user.keytab, I 
> see the following in my app staging directory on HDFS:
> {code}
> -rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
> -rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
> -rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
> __spark_libs__4440821503780683972.zip
> -rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
> -rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
> {code}
> I also see that my spark.yarn.keytab config value has changed to 
> user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
> the keytab is unique within the app staging directory. However, from the 
> directory listing above, it's clear that the file written does not match this 
> new name. As a result, when it comes time to renew the Kerberos ticket, 
> AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed 
> name and also fails to renew the tickets.
> The problem looks to be in one call to [copyFileToRemote in 
> yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
>  that leaves off the destination filename param. The other calls in that 
> object which use copyFileToRemote and have a custom destination name all 
> provide this parameter (e.g., 
> https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19105:


Assignee: (was: Apache Spark)

> yarn/Client.scala copyToRemote does not include keytab destination name
> ---
>
> Key: SPARK-19105
> URL: https://issues.apache.org/jira/browse/SPARK-19105
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: YARN in client mode
>Reporter: Peter Parente
>
> When I specify --principal user@REALM and --keytab /some/path/user.keytab, I 
> see the following in my app staging directory on HDFS:
> {code}
> -rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
> -rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
> -rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
> __spark_libs__4440821503780683972.zip
> -rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
> -rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
> {code}
> I also see that my spark.yarn.keytab config value has changed to 
> user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
> the keytab is unique within the app staging directory. However, from the 
> directory listing above, it's clear that the file written does not match this 
> new name. As a result, when it comes time to renew the Kerberos ticket, 
> AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed 
> name and also fails to renew the tickets.
> The problem looks to be in one call to [copyFileToRemote in 
> yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
>  that leaves off the destination filename param. The other calls in that 
> object which use copyFileToRemote and have a custom destination name all 
> provide this parameter (e.g., 
> https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name

2017-01-06 Thread Peter Parente (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Parente updated SPARK-19105:
--
Comment: was deleted

(was: Related PR https://github.com/apache/spark/pull/16482)

> yarn/Client.scala copyToRemote does not include keytab destination name
> ---
>
> Key: SPARK-19105
> URL: https://issues.apache.org/jira/browse/SPARK-19105
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: YARN in client mode
>Reporter: Peter Parente
>
> When I specify --principal user@REALM and --keytab /some/path/user.keytab, I 
> see the following in my app staging directory on HDFS:
> {code}
> -rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
> -rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
> -rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
> __spark_libs__4440821503780683972.zip
> -rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
> -rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
> {code}
> I also see that my spark.yarn.keytab config value has changed to 
> user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
> the keytab is unique within the app staging directory. However, from the 
> directory listing above, it's clear that the file written does not match this 
> new name. As a result, when it comes time to renew the Kerberos ticket, 
> AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed 
> name and also fails to renew the tickets.
> The problem looks to be in one call to [copyFileToRemote in 
> yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
>  that leaves off the destination filename param. The other calls in that 
> object which use copyFileToRemote and have a custom destination name all 
> provide this parameter (e.g., 
> https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name

2017-01-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804475#comment-15804475
 ] 

Apache Spark commented on SPARK-19105:
--

User 'parente' has created a pull request for this issue:
https://github.com/apache/spark/pull/16482

> yarn/Client.scala copyToRemote does not include keytab destination name
> ---
>
> Key: SPARK-19105
> URL: https://issues.apache.org/jira/browse/SPARK-19105
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: YARN in client mode
>Reporter: Peter Parente
>
> When I specify --principal user@REALM and --keytab /some/path/user.keytab, I 
> see the following in my app staging directory on HDFS:
> {code}
> -rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
> -rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
> -rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
> __spark_libs__4440821503780683972.zip
> -rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
> -rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
> {code}
> I also see that my spark.yarn.keytab config value has changed to 
> user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
> the keytab is unique within the app staging directory. However, from the 
> directory listing above, it's clear that the file written does not match this 
> new name. As a result, when it comes time to renew the Kerberos ticket, 
> AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed 
> name and also fails to renew the tickets.
> The problem looks to be in one call to [copyFileToRemote in 
> yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
>  that leaves off the destination filename param. The other calls in that 
> object which use copyFileToRemote and have a custom destination name all 
> provide this parameter (e.g., 
> https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name

2017-01-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19105:


Assignee: Apache Spark

> yarn/Client.scala copyToRemote does not include keytab destination name
> ---
>
> Key: SPARK-19105
> URL: https://issues.apache.org/jira/browse/SPARK-19105
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: YARN in client mode
>Reporter: Peter Parente
>Assignee: Apache Spark
>
> When I specify --principal user@REALM and --keytab /some/path/user.keytab, I 
> see the following in my app staging directory on HDFS:
> {code}
> -rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
> -rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
> -rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
> __spark_libs__4440821503780683972.zip
> -rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
> -rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
> {code}
> I also see that my spark.yarn.keytab config value has changed to 
> user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
> the keytab is unique within the app staging directory. However, from the 
> directory listing above, it's clear that the file written does not match this 
> new name. As a result, when it comes time to renew the Kerberos ticket, 
> AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed 
> name and also fails to renew the tickets.
> The problem looks to be in one call to [copyFileToRemote in 
> yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
>  that leaves off the destination filename param. The other calls in that 
> object which use copyFileToRemote and have a custom destination name all 
> provide this parameter (e.g., 
> https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name

2017-01-06 Thread Peter Parente (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804476#comment-15804476
 ] 

Peter Parente commented on SPARK-19105:
---

Related PR https://github.com/apache/spark/pull/16482

> yarn/Client.scala copyToRemote does not include keytab destination name
> ---
>
> Key: SPARK-19105
> URL: https://issues.apache.org/jira/browse/SPARK-19105
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: YARN in client mode
>Reporter: Peter Parente
>
> When I specify --principal user@REALM and --keytab /some/path/user.keytab, I 
> see the following in my app staging directory on HDFS:
> {code}
> -rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
> -rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
> -rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
> __spark_libs__4440821503780683972.zip
> -rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
> -rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
> {code}
> I also see that my spark.yarn.keytab config value has changed to 
> user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
> the keytab is unique within the app staging directory. However, from the 
> directory listing above, it's clear that the file written does not match this 
> new name. As a result, when it comes time to renew the Kerberos ticket, 
> AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed 
> name and also fails to renew the tickets.
> The problem looks to be in one call to [copyFileToRemote in 
> yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
>  that leaves off the destination filename param. The other calls in that 
> object which use copyFileToRemote and have a custom destination name all 
> provide this parameter (e.g., 
> https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name

2017-01-06 Thread Peter Parente (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Parente updated SPARK-19105:
--
Description: 
When I specify --principal user@REALM and --keytab /some/path/user.keytab, I 
see the following in my app staging directory on HDFS:

{code}
-rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
-rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
-rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
__spark_libs__4440821503780683972.zip
-rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
-rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
{code}

I also see that my spark.yarn.keytab config value has changed to 
user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
the keytab is unique within the app staging directory. However, from the 
directory listing above, it's clear that the file written does not match this 
new name. As a result, when it comes time to renew the Kerberos ticket, 
AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed name 
and also fails to renew the tickets.

The problem looks to be in one call to [copyFileToRemote in 
yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
 that leaves off the destination filename param. The other calls in that object 
which use copyFileToRemote and have a custom destination name all provide this 
parameter (e.g., 
https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).

  was:
When I specify {{monospaced}}--principal user@REALM{{monospaced}} and 
{{monospaced}}--keytab /some/path/user.keytab{{monospaced}}, I see the 
following in my app staging directory on HDFS:

{code}
-rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
-rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
-rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
__spark_libs__4440821503780683972.zip
-rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
-rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
{code}

I also see that my spark.yarn.keytab config value has changed to 
user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
the keytab is unique within the app staging directory. However, from the 
directory listing above, it's clear that the file written does not match this 
new name. As a result, when it comes time to renew the Kerberos ticket, 
AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed name 
and also fails to renew the tickets.

The problem looks to be in one call to [copyFileToRemote in 
yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
 that leaves off the destination filename param. The other calls in that object 
which use copyFileToRemote and have a custom destination name all provide this 
parameter (e.g., 
https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).


> yarn/Client.scala copyToRemote does not include keytab destination name
> ---
>
> Key: SPARK-19105
> URL: https://issues.apache.org/jira/browse/SPARK-19105
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
> Environment: YARN in client mode
>Reporter: Peter Parente
>
> When I specify --principal user@REALM and --keytab /some/path/user.keytab, I 
> see the following in my app staging directory on HDFS:
> {code}
> -rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
> -rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
> -rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
> __spark_libs__4440821503780683972.zip
> -rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
> -rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
> {code}
> I also see that my spark.yarn.keytab config value has changed to 
> user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
> the keytab is unique within the app staging directory. However, from the 
> directory listing above, it's clear that the file written does not match this 
> new name. As a result, when it comes time to renew the Kerberos ticket, 
> AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed 
> name and also fails to 

[jira] [Created] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name

2017-01-06 Thread Peter Parente (JIRA)
Peter Parente created SPARK-19105:
-

 Summary: yarn/Client.scala copyToRemote does not include keytab 
destination name
 Key: SPARK-19105
 URL: https://issues.apache.org/jira/browse/SPARK-19105
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.1.0
 Environment: YARN in client mode
Reporter: Peter Parente


When I specify {{monospaced}}--principal user@REALM{{monospaced}} and 
{{monospaced}}--keytab /some/path/user.keytab{{monospaced}}, I see the 
following in my app staging directory on HDFS:

{code}
-rw-r--r--   3 user supergroup 68 2017-01-06 03:59 user.keytab
-rw-r--r--   3 user supergroup  73502 2017-01-06 03:59 __spark_conf__.zip
-rw-r--r--   3 user supergroup  189767340 2017-01-06 03:59 
__spark_libs__4440821503780683972.zip
-rw-r--r--   3 user supergroup  91275 2017-01-06 03:59 py4j-0.10.3-src.zip
-rw-r--r--   3 user supergroup 440385 2017-01-06 03:59 pyspark.zip
{code}

I also see that my spark.yarn.keytab config value has changed to 
user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure 
the keytab is unique within the app staging directory. However, from the 
directory listing above, it's clear that the file written does not match this 
new name. As a result, when it comes time to renew the Kerberos ticket, 
AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed name 
and also fails to renew the tickets.

The problem looks to be in one call to [copyFileToRemote in 
yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482]
 that leaves off the destination filename param. The other calls in that object 
which use copyFileToRemote and have a custom destination name all provide this 
parameter (e.g., 
https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee

2017-01-06 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804425#comment-15804425
 ] 

Gaurav Shah commented on SPARK-9215:


[~tdas] I know this is an old pull request but was still wondering if you can 
help. I was wondering can we enhance this to make sure that we checkpoint only 
after blocks of data has been written. So we need to implement Spark checkpoint 
in the first place. Each block has a start and end seq number.

> Implement WAL-free Kinesis receiver that give at-least once guarantee
> -
>
> Key: SPARK-9215
> URL: https://issues.apache.org/jira/browse/SPARK-9215
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.5.0
>
>
> Currently, the KinesisReceiver can loose some data in the case of certain 
> failures (receiver and driver failures). Using the write ahead logs can 
> mitigate some of the problem, but it is not ideal because WALs dont work with 
> S3 (eventually consistency, etc.) which is the most likely file system to be 
> used in the EC2 environment. Hence, we have to take a different approach to 
> improving reliability for Kinesis.
> Detailed design doc - 
> https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-01-06 Thread Nils Grabbert (JIRA)
Nils Grabbert created SPARK-19104:
-

 Summary:  CompileException with Map and Case Class in Spark 2.1.0
 Key: SPARK-19104
 URL: https://issues.apache.org/jira/browse/SPARK-19104
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Nils Grabbert


The following code will run with Spark 2.0.2 but not with Spark 2.1.0:

{code}
case class InnerData(name: String, value: Int)
case class Data(id: Int, param: Map[String, InnerData])

val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
100
val ds   = spark.createDataset(data)
{code}

Exception:
Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 63, Column 46: Expression 
"ExternalMapToCatalyst_value_isNull1" is not an rvalue 
  at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
  at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
 
  at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
  at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
  at 
org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984) 
  at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
  at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
  at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
  at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
  at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
  at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
  at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
  at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
  at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
  at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
 
  at 
org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
 
  at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
  at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
  at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) 
  at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
  at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
  at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
 
  at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
 
  at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
  at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
  at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
  at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
 
  at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
 
  at 
org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
 
  at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
  at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
  at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
 
  at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
 
  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
  at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) 
  at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91) 
  at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:935)
 
  ... 77 more 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Commented] (SPARK-18997) Recommended upgrade libthrift to 0.9.3

2017-01-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804135#comment-15804135
 ] 

Sean Owen commented on SPARK-18997:
---

Help with what, opening a PR? http://spark.apache.org/contributing.html
You can look at the libthrift changes from the project itself to assess what 
changed. IIRC it was a lot.

> Recommended upgrade libthrift  to 0.9.3
> ---
>
> Key: SPARK-18997
> URL: https://issues.apache.org/jira/browse/SPARK-18997
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: meiyoula
>Priority: Minor
>
> libthrift 0.9.2 has a serious security vulnerability:CVE-2015-3254



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19097) virtualenv example failed with conda due to ImportError: No module named ruamel.yaml.comments

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19097.
---
Resolution: Duplicate

I don't see value in opening a bunch of JIRAs on the same theme. These look 
like near duplicates and depend on this functionality being supported in the 
first place, which isn't apparently supported according to the parent.

> virtualenv example failed with conda due to ImportError: No module named 
> ruamel.yaml.comments
> -
>
> Key: SPARK-19097
> URL: https://issues.apache.org/jira/browse/SPARK-19097
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Yesha Vora
>
> Spark version : 2
> Steps:
> * install conda on all nodes (python2.7) ( pip install conda )
> * create requirement1.txt with "numpy > requirement1.txt "
> * Run kmeans.py application in yarn-client mode. 
> {code}
> spark-submit --master yarn --deploy-mode client --conf 
> "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.type=conda" --conf 
> "spark.pyspark.virtualenv.requirements=/tmp/requirements1.txt" --conf 
> "spark.pyspark.virtualenv.bin.path=/usr/bin/conda" --jars 
> /usr/hadoop-client/lib/hadoop-lzo.jar kmeans.py /tmp/in/kmeans_data.txt 
> 3{code}
> {code:title=app log}
> 17/01/06 01:39:25 DEBUG PythonWorkerFactory: user.home=/home/yarn
> 17/01/06 01:39:25 DEBUG PythonWorkerFactory: Running command:/usr/bin/conda 
> create --prefix 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1483592608863_0017/container_1483592608863_0017_01_03/virtualenv_application_1483592608863_0017_0
>  --file requirements1.txt -y
> Traceback (most recent call last):
>   File "/usr/bin/conda", line 11, in 
> load_entry_point('conda==4.2.7', 'console_scripts', 'conda')()
>   File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 
> 561, in load_entry_point
> return get_distribution(dist).load_entry_point(group, name)
>   File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 
> 2631, in load_entry_point
> return ep.load()
>   File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 
> 2291, in load
> return self.resolve()
>   File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line 
> 2297, in resolve
> module = __import__(self.module_name, fromlist=['__name__'], level=0)
>   File "/usr/lib/python2.7/site-packages/conda/cli/__init__.py", line 8, in 
> 
> from .main import main  # NOQA
>   File "/usr/lib/python2.7/site-packages/conda/cli/main.py", line 46, in 
> 
> from ..base.context import context
>   File "/usr/lib/python2.7/site-packages/conda/base/context.py", line 18, in 
> 
> from ..common.configuration import (Configuration, MapParameter, 
> PrimitiveParameter,
>   File "/usr/lib/python2.7/site-packages/conda/common/configuration.py", line 
> 40, in 
> from ruamel.yaml.comments import CommentedSeq, CommentedMap  # pragma: no 
> cover
> ImportError: No module named ruamel.yaml.comments
> 17/01/06 01:39:26 WARN BlockManager: Putting block rdd_3_0 failed due to an 
> exception
> 17/01/06 01:39:26 WARN BlockManager: Block rdd_3_0 could not be removed as it 
> was not found on disk or in memory
> 17/01/06 01:39:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.RuntimeException: Fail to run command: /usr/bin/conda create 
> --prefix 
> /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1483592608863_0017/container_1483592608863_0017_01_03/virtualenv_application_1483592608863_0017_0
>  --file requirements1.txt -y
> at 
> org.apache.spark.api.python.PythonWorkerFactory.execCommand(PythonWorkerFactory.scala:142)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.setupVirtualEnv(PythonWorkerFactory.scala:124)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.(PythonWorkerFactory.scala:70)
> at 
> org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:117)
> at 
> org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:117)
> at 
> scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
> at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
> at 
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 

[jira] [Updated] (SPARK-19102) Accuracy error of spark SQL results

2017-01-06 Thread XiaodongCui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaodongCui updated SPARK-19102:

Description: 
the problem is cube6's  second column named sumprice is 1 times bigger than 
the cube5's  second column named sumprice,but  they should be equal .the bug is 
only reappear  in the format like sum(a * b),count (distinct  c)

code:

DataFrame 
df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
df1.registerTempTable("hd_salesflat");
DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM hd_salesflat 
GROUP BY areacode1");
cube5.show(50);
cube6.show(50);

my  data:
transno | quantity | unitprice | areacode1
76317828|  1.  |  25.  |  HDCN

data schema:
 |-- areacode1: string (nullable = true)
 |-- quantity: decimal(20,4) (nullable = true)
 |-- unitprice: decimal(20,4) (nullable = true)
 |-- transno: string (nullable = true)

  was:
the problem is cube6's  second column named sumprice is 1 times bigger than 
the cube5's  second column named sumprice,but  they should be equal .the bug is 
only reappear  in the format like sum(a * b),count (distinct  c)

DataFrame 
df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
df1.registerTempTable("hd_salesflat");
DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM hd_salesflat 
GROUP BY areacode1");
cube5.show(50);
cube6.show(50);

my  data:
transno | quantity | unitprice | areacode1
76317828|  1.  |  25.  |  HDCN

data schema:
 |-- areacode1: string (nullable = true)
 |-- quantity: decimal(20,4) (nullable = true)
 |-- unitprice: decimal(20,4) (nullable = true)
 |-- transno: string (nullable = true)


> Accuracy error of spark SQL results
> ---
>
> Key: SPARK-19102
> URL: https://issues.apache.org/jira/browse/SPARK-19102
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6
>Reporter: XiaodongCui
> Attachments: a.zip
>
>
> the problem is cube6's  second column named sumprice is 1 times bigger 
> than the cube5's  second column named sumprice,but  they should be equal .the 
> bug is only reappear  in the format like sum(a * b),count (distinct  c)
> code:
> 
>   DataFrame 
> df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
>   df1.registerTempTable("hd_salesflat");
>   DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
>   DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM 
> hd_salesflat GROUP BY areacode1");
>   cube5.show(50);
>   cube6.show(50);
> 
> my  data:
> transno | quantity | unitprice | areacode1
> 76317828|  1.  |  25.  |  HDCN
> data schema:
>  |-- areacode1: string (nullable = true)
>  |-- quantity: decimal(20,4) (nullable = true)
>  |-- unitprice: decimal(20,4) (nullable = true)
>  |-- transno: string (nullable = true)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-19102) Accuracy error of spark SQL results

2017-01-06 Thread XiaodongCui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaodongCui reopened SPARK-19102:
-

the data under the path :hdfs://cdh01:8020/sandboxdata_A/test/a in the attach 
file

> Accuracy error of spark SQL results
> ---
>
> Key: SPARK-19102
> URL: https://issues.apache.org/jira/browse/SPARK-19102
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6
>Reporter: XiaodongCui
> Attachments: a.zip
>
>
> the problem is cube6's  second column named sumprice is 1 times bigger 
> than the cube5's  second column named sumprice,but  they should be equal .the 
> bug is only reappear  in the format like sum(a * b),count (distinct  c)
>   DataFrame 
> df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
>   df1.registerTempTable("hd_salesflat");
>   DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
>   DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM 
> hd_salesflat GROUP BY areacode1");
>   cube5.show(50);
>   cube6.show(50);
> my  data:
> transno | quantity | unitprice | areacode1
> 76317828|  1.  |  25.  |  HDCN
> data schema:
>  |-- areacode1: string (nullable = true)
>  |-- quantity: decimal(20,4) (nullable = true)
>  |-- unitprice: decimal(20,4) (nullable = true)
>  |-- transno: string (nullable = true)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19101) Spark Beeline catch a exeception when run command " load data inpath '/data/test/test.csv' overwrite into table db.test partition(area='021')"

2017-01-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804113#comment-15804113
 ] 

Sean Owen commented on SPARK-19101:
---

This is more of a Hive error, and suggests something else went wrong earlier 
("filesystem closed"). By itself I don't think this is actionable.

> Spark Beeline catch a exeception when run command " load data inpath 
> '/data/test/test.csv'  overwrite into table db.test partition(area='021')"
> ---
>
> Key: SPARK-19101
> URL: https://issues.apache.org/jira/browse/SPARK-19101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: spark2.0.1
>Reporter: Xiaochen Ouyang
>
> firtstly,two commands as follow:
> 1:load data  inpath '/data/test/lte_cm_projdata_52.csv' overwrite into 
> table  db.lte_cm_projdata partition(p_provincecode=52);
> 2:load data local  inpath '/home/mr/lte_cm_projdata_52.csv' overwrite 
> into table  db.lte_cm_projdata partition(p_provincecode=52);
> the first command run failed,but the second command run success.
> beeline execption:
> 0: jdbc:hive2://10.43.156.221:18000> load data  inpath 
> '/data/test/lte_cm_projdata_52.csv' overwrite into table  
> db.lte_cm_projdata partition(p_provincecode=52);
> Error: java.lang.reflect.InvocationTargetException (state=,code=0)
> ThriftServer2 logs :
> 2017-01-06 15:16:16,518 INFO HiveMetaStore: 58: get_partition_with_auth : 
> db=zxvmax tbl=lte_cm_projdata[52]
> 2017-01-06 15:16:16,518 INFO audit: ugi=rootip=unknown-ip-addr  
> cmd=get_partition_with_auth : db=zxvmax tbl=lte_cm_projdata[52]
> 2017-01-06 15:16:16,611 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING, 
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.sql.hive.client.Shim_v0_14.loadPartition(HiveShim.scala:622)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply$mcV$sp(HiveClientImpl.scala:635)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply(HiveClientImpl.scala:635)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply(HiveClientImpl.scala:635)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:634)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply$mcV$sp(HiveExternalCatalog.scala:279)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply(HiveExternalCatalog.scala:271)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply(HiveExternalCatalog.scala:271)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:271)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:317)
> at 
> org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:325)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> 

[jira] [Updated] (SPARK-19102) Accuracy error of spark SQL results

2017-01-06 Thread XiaodongCui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaodongCui updated SPARK-19102:

Description: 
the problem is cube6's  second column named sumprice is 1 times bigger than 
the cube5's  second column named sumprice,but  they should be equal .the bug is 
only reappear  in the format like sum(a * b),count (distinct  c)

DataFrame 
df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
df1.registerTempTable("hd_salesflat");
DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM hd_salesflat 
GROUP BY areacode1");
cube5.show(50);
cube6.show(50);

my  data:
transno | quantity | unitprice | areacode1
76317828|  1.  |  25.  |  HDCN

data schema:
 |-- areacode1: string (nullable = true)
 |-- quantity: decimal(20,4) (nullable = true)
 |-- unitprice: decimal(20,4) (nullable = true)
 |-- transno: string (nullable = true)

  was:
the problem is the  result  of the code blow that the second column's value   
is not the same.the second  sql result is 1 times bigger than the first 
sql result.the bug is only reappear  in the format like sum(a * b),count 
(distinct  c)

DataFrame 
df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
df1.registerTempTable("hd_salesflat");
DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM hd_salesflat 
GROUP BY areacode1");
cube5.show(50);
cube6.show(50);

my  data:
transno | quantity | unitprice | areacode1
76317828|  1.  |  25.  |  HDCN

data schema:
 |-- areacode1: string (nullable = true)
 |-- quantity: decimal(20,4) (nullable = true)
 |-- unitprice: decimal(20,4) (nullable = true)
 |-- transno: string (nullable = true)


> Accuracy error of spark SQL results
> ---
>
> Key: SPARK-19102
> URL: https://issues.apache.org/jira/browse/SPARK-19102
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6
>Reporter: XiaodongCui
> Attachments: a.zip
>
>
> the problem is cube6's  second column named sumprice is 1 times bigger 
> than the cube5's  second column named sumprice,but  they should be equal .the 
> bug is only reappear  in the format like sum(a * b),count (distinct  c)
>   DataFrame 
> df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
>   df1.registerTempTable("hd_salesflat");
>   DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
>   DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM 
> hd_salesflat GROUP BY areacode1");
>   cube5.show(50);
>   cube6.show(50);
> my  data:
> transno | quantity | unitprice | areacode1
> 76317828|  1.  |  25.  |  HDCN
> data schema:
>  |-- areacode1: string (nullable = true)
>  |-- quantity: decimal(20,4) (nullable = true)
>  |-- unitprice: decimal(20,4) (nullable = true)
>  |-- transno: string (nullable = true)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19102) Accuracy error of spark SQL results

2017-01-06 Thread XiaodongCui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiaodongCui updated SPARK-19102:

Attachment: a.zip

the attach file is my data,the data is parquet format

> Accuracy error of spark SQL results
> ---
>
> Key: SPARK-19102
> URL: https://issues.apache.org/jira/browse/SPARK-19102
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6
>Reporter: XiaodongCui
> Attachments: a.zip
>
>
> the problem is the  result  of the code blow that the second column's value   
> is not the same.the second  sql result is 1 times bigger than the   first 
> sql result.the bug is only reappear  in the format like sum(a * b),count 
> (distinct  c)
>   DataFrame 
> df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
>   df1.registerTempTable("hd_salesflat");
>   DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
>   DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM 
> hd_salesflat GROUP BY areacode1");
>   cube5.show(50);
>   cube6.show(50);
> my  data:
> transno | quantity | unitprice | areacode1
> 76317828|  1.  |  25.  |  HDCN
> data schema:
>  |-- areacode1: string (nullable = true)
>  |-- quantity: decimal(20,4) (nullable = true)
>  |-- unitprice: decimal(20,4) (nullable = true)
>  |-- transno: string (nullable = true)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19102) Accuracy error of spark SQL results

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19102.
---
Resolution: Invalid

This doesn't describe a problem clearly. There's no data, no specifics about 
'accuracy' and some unclear reference to something being 1 times bigger.

> Accuracy error of spark SQL results
> ---
>
> Key: SPARK-19102
> URL: https://issues.apache.org/jira/browse/SPARK-19102
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.6.0, 1.6.1
> Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6
>Reporter: XiaodongCui
>
> the problem is the  result  of the code blow that the second column's value   
> is not the same.the second  sql result is 1 times bigger than the   first 
> sql result.the bug is only reappear  in the format like sum(a * b),count 
> (distinct  c)
>   DataFrame 
> df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a");
>   df1.registerTempTable("hd_salesflat");
>   DataFrame cube5 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1");
>   DataFrame cube6 = sqlContext.sql("SELECT areacode1, 
> SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno)  FROM 
> hd_salesflat GROUP BY areacode1");
>   cube5.show(50);
>   cube6.show(50);
> my  data:
> transno | quantity | unitprice | areacode1
> 76317828|  1.  |  25.  |  HDCN
> data schema:
>  |-- areacode1: string (nullable = true)
>  |-- quantity: decimal(20,4) (nullable = true)
>  |-- unitprice: decimal(20,4) (nullable = true)
>  |-- transno: string (nullable = true)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19095) virtualenv example does not work in yarn cluster mode

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19095.
---
Resolution: Duplicate

If this isn't something supported yet, it's not a bug, and I'd resolve this as 
a duplciate of the parent.

> virtualenv example does not work in yarn cluster mode
> -
>
> Key: SPARK-19095
> URL: https://issues.apache.org/jira/browse/SPARK-19095
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Yesha Vora
>Priority: Critical
>
> Spark version: 2
> Steps:
> * install virtualenv on all nodes
> * create requirement1.txt with "numpy > requirement1.txt "
> * Run kmeans.py application in yarn-cluster mode. 
> {code}
> spark-submit --master yarn --deploy-mode cluster --conf 
> "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.type=native" --conf 
> "spark.pyspark.virtualenv.requirements=/tmp/requirements1.txt" --conf 
> "spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv" --jars 
> /usr/hdp/current/hadoop-client/lib/hadoop-lzo.jar kmeans.py 
> /tmp/in/kmeans_data.txt 3{code}
> The application fails to find numpy.
> {code}
> LogType:stdout
> Log Upload Time:Thu Jan 05 20:05:49 + 2017
> LogLength:134
> Log Contents:
> Traceback (most recent call last):
>   File "kmeans.py", line 27, in 
> import numpy as np
> ImportError: No module named numpy
> End of LogType:stdout
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19096) Kmeans.py application fails with virtualenv and due to parse error

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19096.
---
Resolution: Duplicate

> Kmeans.py application fails with virtualenv and due to  parse error 
> 
>
> Key: SPARK-19096
> URL: https://issues.apache.org/jira/browse/SPARK-19096
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Yesha Vora
>
> Spark version : 2
> Steps:
> * Install virtualenv ( pip install virtualenv)
> * create requirements.txt (pip freeze > /tmp/requirements.txt)
> * start kmeans.py application in yarn-client mode.
> The application fails with Runtime Exception
> {code:title=app log}
> 17/01/05 19:49:59 INFO deprecation: mapred.task.partition is deprecated. 
> Instead, use mapreduce.task.partition
> 17/01/05 19:49:59 INFO deprecation: mapred.job.id is deprecated. Instead, use 
> mapreduce.job.id
> Invalid requirement: 'pip freeze'
> Traceback (most recent call last):
>   File 
> "/grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1483592608863_0006/container_1483592608863_0006_01_02/virtualenv_application_1483592608863_0006_0/lib/python2.7/site-packages/pip/req/req_install.py",
>  line 82, in __init__
> req = Requirement(req)
>   File 
> "/grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1483592608863_0006/container_1483592608863_0006_01_02/virtualenv_application_1483592608863_0006_0/lib/python2.7/site-packages/pip/_vendor/packaging/requirements.py",
>  line 96, in __init__
> requirement_string[e.loc:e.loc + 8]))
> InvalidRequirement: Invalid requirement, parse error at "u'freeze'"
> 17/01/05 19:50:03 WARN BlockManager: Putting block rdd_3_0 failed due to an 
> exception
> 17/01/05 19:50:03 WARN BlockManager: Block rdd_3_0 could not be removed as it 
> was not found on disk or in memory
> 17/01/05 19:50:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> {code}
> {code:title=job client log}
> 17/01/05 19:50:07 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2, 
> xxx.site, executor 1): java.lang.RuntimeException: Fail to run command: 
> virtualenv_application_1483592608863_0006_1/bin/python -m pip --cache-dir 
> /home/yarn install -r requirements.txt
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.execCommand(PythonWorkerFactory.scala:142)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.setupVirtualEnv(PythonWorkerFactory.scala:128)
>   at 
> org.apache.spark.api.python.PythonWorkerFactory.(PythonWorkerFactory.scala:70)
>   at 
> org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:117)
>   at 
> org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:117)
>   at 
> scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
>   at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at 

[jira] [Updated] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19098:
--
Priority: Minor  (was: Critical)

> Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
> -
>
> Key: SPARK-19098
> URL: https://issues.apache.org/jira/browse/SPARK-19098
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.1.0
> Environment: Linux x64
> Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0)
> Spark on YARN, dynamic allocation with shuffle service
> Input/Output data on HDFS
> kryo serialization turned on
> checkpointing directory set on HDFS
>Reporter: Steven Ruppert
>Priority: Minor
> Attachments: doubling-season.png
>
>
> I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla 
> ConnectedComponents use, notably one that works fine with identical code on 
> spark 2.0.1, but not on 2.1.0.
> I unfortunately haven't narrowed this down to a test case yet, nor do I have 
> access to the original logs, so this initial report will be a little vague. 
> However, this behavior as described might ring a bell to somebody.
> Roughly: 
> {noformat}
> val edges: RDD[Edge[Int]] = _ // from file
> val vertices: RDD[(VertexId, Int)] = _ // from file
> val graph = Graph(vertices, edges)
> val components: RDD[(VertexId, ComponentId)] = ConnectedComponents
>   .run(graph, 10)
>   .vertices
> {noformat}
> Running this against my input of ~5B edges and ~3B vertices leads to a 
> strange doubling of shuffle traffic in each round of Pregel (inside 
> ConnectedComponents), increasing from the actual data size of ~50 GB, to 
> 100GB, to 200GB, all the way to around 40TB before I killed the job. The data 
> being shuffled was apparently an RDD of ShippableVertexPartition .
> Oddly enough, only the kryo-serialized shuffled data doubled in size. The 
> heap usage of the executors themselves remained stable, or at least did not 
> account 1 to 1 for the 40TB of shuffled data, for I definitely do not have 
> 40TB of RAM. Furthermore, I also have kryo reference tracking turned on 
> still, so whatever is leaking somehow gets around that.
> I'll update this ticket once I have more details, unless somebody else with 
> the same problem reports back first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19103) In web ui,URL's host name should be a specific IP address.

2017-01-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19103.
---
Resolution: Invalid

I can't understand what this means, but, it's not true that the UI should use 
only IP addresses

> In web ui,URL's host name should be a specific IP address.
> --
>
> Key: SPARK-19103
> URL: https://issues.apache.org/jira/browse/SPARK-19103
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
> Environment: spark 2.0.2
>Reporter: guoxiaolong
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> In web ui,URL's host name should be a specific IP address.Because open URL 
> must be resolve host name.It can not find host name.So URL can not 
> find.Please see the attachment.Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >