[jira] [Commented] (SPARK-19103) In web ui,URL's host name should be a specific IP address.
[ https://issues.apache.org/jira/browse/SPARK-19103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807033#comment-15807033 ] guoxiaolong commented on SPARK-19103: - Because the browser opens this address 404.We must configure the domain name and IP mapping in the hosts file..Please see the attachment.3.png > In web ui,URL's host name should be a specific IP address. > -- > > Key: SPARK-19103 > URL: https://issues.apache.org/jira/browse/SPARK-19103 > Project: Spark > Issue Type: Bug > Components: Web UI > Environment: spark 2.0.2 >Reporter: guoxiaolong >Priority: Minor > Attachments: 1.png, 2.png > > > In web ui,URL's host name should be a specific IP address.Because open URL > must be resolve host name.It can not find host name.So URL can not > find.Please see the attachment.Thank you! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19115) SparkSQL unsupported the command " create external table if not exist new_tbl like old_tbl"
Xiaochen Ouyang created SPARK-19115: --- Summary: SparkSQL unsupported the command " create external table if not exist new_tbl like old_tbl" Key: SPARK-19115 URL: https://issues.apache.org/jira/browse/SPARK-19115 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Environment: spark2.0.1 hive1.2.1 Reporter: Xiaochen Ouyang spark2.0.1 unsupported the command " create external table if not exist new_tbl like old_tbl" we tried to modify the sqlbase.g4 file,change "| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" to "| CREATE EXTERNAL? TABLE (IF NOT EXISTS)? target=tableIdentifier LIKE source=tableIdentifier #createTableLike" and then we compiled spark and replaced the jar "spark-catalyst-2.0.1.jar" ,after that,we found we can run command "create external table if not exist new_tbl like old_tbl" successfully,unfortunately we found the generated table's type is MANAGED_TABLE other than EXTERNAL_TABLE in metastore database . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806897#comment-15806897 ] jin xing commented on SPARK-18113: -- [~xq2005], [~aash] I am seeing this issue in my cluster some times. If *OutputCommittCoordinatorEndpoint* receive *AskPermissionToCommitOutput* for the first time, *OutputCommitCoordinatoryEndpoint* will mark the task attempt as a committer in *authorizedCommittersByStage* and send back the response. But if the worker failed to get the response in *spark.rpc.timeout*, it will retry sending *AskPermissionToCommitOutput*. However it will be denied by *OutputCommitCoordinatorEndpoint*, because it has already registered a committer for the partition, even though the registered committer and the worker are the same. Reproducing is easy: {code:title=OutputCommitCoordinator.scala|borderStyle=solid} .. // Marked private[scheduler] instead of private so this can be mocked in tests private[scheduler] def handleAskPermissionToCommit( stage: StageId, partition: PartitionId, attemptNumber: TaskAttemptNumber): Boolean = synchronized { authorizedCommittersByStage.get(stage) match { case Some(authorizedCommitters) => authorizedCommitters(partition) match { case NO_AUTHORIZED_COMMITTER => logDebug(s"Authorizing attemptNumber=$attemptNumber to commit for stage=$stage, " + s"partition=$partition") authorizedCommitters(partition) = attemptNumber Thread.sleep(15) true case existingCommitter => logDebug(s"Denying attemptNumber=$attemptNumber to commit for stage=$stage, " + s"partition=$partition; existingCommitter = $existingCommitter") false } case None => logDebug(s"Stage $stage has completed, so not allowing attempt number $attemptNumber of" + s"partition $partition to commit") false } } .. {code} When worker asks to be registered as a committer for the first time, sleep 150 seconds, which is bigger than *spark.rpc.timeout=120 seconds*. when worker retries *AskPermissionToCommitOutput* it will get *CommitDeniedException*, then the task will fail with reason *TaskCommitDenied*, which is not regarded as a task failure(SPARK-11178), so TaskScheduler will schedule this task infinitely. [~xq2005] If you don't have time, could I make a pr for this? > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at >
[jira] [Closed] (SPARK-18929) Add Tweedie distribution in GLM
[ https://issues.apache.org/jira/browse/SPARK-18929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang closed SPARK-18929. --- Resolution: Unresolved > Add Tweedie distribution in GLM > --- > > Key: SPARK-18929 > URL: https://issues.apache.org/jira/browse/SPARK-18929 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 72h > Remaining Estimate: 72h > > I propose to add the full Tweedie family into the GeneralizedLinearRegression > model. The Tweedie family is characterized by a power variance function. > Currently supported distributions such as Gaussian, Poisson and Gamma > families are a special case of the > [Tweedie|https://en.wikipedia.org/wiki/Tweedie_distribution]. > I propose to add support for the other distributions: > * compound Poisson: 1 < variancePower < 2. This one is widely used to model > zero-inflated continuous distributions. > * positive stable: variancePower > 2 and variancePower != 3. Used to model > extreme values. > * inverse Gaussian: variancePower = 3. > The Tweedie family is supported in most statistical packages such as R > (statmod), SAS, h2o etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)
[ https://issues.apache.org/jira/browse/SPARK-19108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806774#comment-15806774 ] Shivaram Venkataraman commented on SPARK-19108: --- +1 - This is a good idea. One thing I'd like to add is that it might be better to create one broadcast rather than two broadcasts for sake of efficiency. For each broadcast variable we contact the driver to get location information and then initiate some fetches -- Thus to keep the number of messages lower having one broadcast variable will be better. > Broadcast all shared parts of tasks (to reduce task serialization time) > --- > > Key: SPARK-19108 > URL: https://issues.apache.org/jira/browse/SPARK-19108 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Kay Ousterhout > > Expand the amount of information that's broadcasted for tasks, to avoid > serializing data per-task that should only be sent to each executor once for > the entire stage. > Conceptually, this means we'd have new classes specially for sending the > minimal necessary data to the executor, like: > {code} > /** > * metadata about the taskset needed by the executor for all tasks in this > taskset. Subset of the > * full data kept on the driver to make it faster to serialize and send to > executors. > */ > class ExecutorTaskSetMeta( > val stageId: Int, > val stageAttemptId: Int, > val properties: Properties, > val addedFiles: Map[String, String], > val addedJars: Map[String, String] > // maybe task metrics here? > ) > class ExecutorTaskData( > val partitionId: Int, > val attemptNumber: Int, > val taskId: Long, > val taskBinary: Broadcast[Array[Byte]], > val taskSetMeta: Broadcast[ExecutorTaskSetMeta] > ) > {code} > Then all the info you'd need to send to the executors would be a serialized > version of ExecutorTaskData. Furthermore, given the simplicity of that > class, you could serialize manually, and then for each task you could just > modify the first two ints & one long directly in the byte buffer. (You could > do the same trick for serialization even if ExecutorTaskSetMeta was not a > broadcast, but that will keep the msgs small as well.) > There a bunch of details I'm skipping here: you'd also need to do some > special handling for the TaskMetrics; the way tasks get started in the > executor would change; you'd also need to refactor {{Task}} to let it get > reconstructed from this information (or add more to ExecutorTaskSetMeta); and > probably other details I'm overlooking now. > (this is copied from SPARK-18890 and [~imranr]'s comment there; cc > [~shivaram]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model
[ https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19110: -- Shepherd: Joseph K. Bradley Target Version/s: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > DistributedLDAModel returns different logPrior for original and loaded model > > > Key: SPARK-19110 > URL: https://issues.apache.org/jira/browse/SPARK-19110 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Miao Wang > > While adding DistributedLDAModel training summary for SparkR, I found that > the logPrior for original and loaded model is different. > For example, in the test("read/write DistributedLDAModel"), I add the test: > val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior > val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior > assert(logPrior === logPrior2) > The test fails: > -4.394180878889078 did not equal -4.294290536919573 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model
[ https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-19110: -- Affects Version/s: 2.2.0 1.3.1 1.4.1 1.5.2 1.6.3 2.0.2 2.1.0 > DistributedLDAModel returns different logPrior for original and loaded model > > > Key: SPARK-19110 > URL: https://issues.apache.org/jira/browse/SPARK-19110 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Miao Wang > > While adding DistributedLDAModel training summary for SparkR, I found that > the logPrior for original and loaded model is different. > For example, in the test("read/write DistributedLDAModel"), I add the test: > val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior > val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior > assert(logPrior === logPrior2) > The test fails: > -4.394180878889078 did not equal -4.294290536919573 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18194) Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit
[ https://issues.apache.org/jira/browse/SPARK-18194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-18194. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16480 [https://github.com/apache/spark/pull/16480] > Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit > -- > > Key: SPARK-18194 > URL: https://issues.apache.org/jira/browse/SPARK-18194 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: zhengruifeng >Assignee: Sue Ann Hong > Fix For: 2.2.0 > > > Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858
[ https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806575#comment-15806575 ] Apache Spark commented on SPARK-16920: -- User 'mhmoudr' has created a pull request for this issue: https://github.com/apache/spark/pull/16495 > Investigate and fix issues introduced in SPARK-15858 > > > Key: SPARK-16920 > URL: https://issues.apache.org/jira/browse/SPARK-16920 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg > > There were several issues regarding the PR resolving SPARK-15858, my comments > are available here: > https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93 > The two most important issues are: > 1. The PR did not add a stress test proving it resolved the issue it was > supposed to (though I have no doubt the optimization made is indeed correct). > 2. The PR introduced quadratic prediction time in terms of the number of > trees, which was previously linear. This issue needs to be investigated for > whether it causes problems for large numbers of trees (say, 1000), an > appropriate test should be added, and then fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858
[ https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16920: Assignee: (was: Apache Spark) > Investigate and fix issues introduced in SPARK-15858 > > > Key: SPARK-16920 > URL: https://issues.apache.org/jira/browse/SPARK-16920 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg > > There were several issues regarding the PR resolving SPARK-15858, my comments > are available here: > https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93 > The two most important issues are: > 1. The PR did not add a stress test proving it resolved the issue it was > supposed to (though I have no doubt the optimization made is indeed correct). > 2. The PR introduced quadratic prediction time in terms of the number of > trees, which was previously linear. This issue needs to be investigated for > whether it causes problems for large numbers of trees (say, 1000), an > appropriate test should be added, and then fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16920) Investigate and fix issues introduced in SPARK-15858
[ https://issues.apache.org/jira/browse/SPARK-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16920: Assignee: Apache Spark > Investigate and fix issues introduced in SPARK-15858 > > > Key: SPARK-16920 > URL: https://issues.apache.org/jira/browse/SPARK-16920 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Assignee: Apache Spark > > There were several issues regarding the PR resolving SPARK-15858, my comments > are available here: > https://github.com/apache/spark/commit/393db655c3c43155305fbba1b2f8c48a95f18d93 > The two most important issues are: > 1. The PR did not add a stress test proving it resolved the issue it was > supposed to (though I have no doubt the optimization made is indeed correct). > 2. The PR introduced quadratic prediction time in terms of the number of > trees, which was previously linear. This issue needs to be investigated for > whether it causes problems for large numbers of trees (say, 1000), an > appropriate test should be added, and then fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19114) Backpressure rate is cast from double to long to double
Tony Novak created SPARK-19114: -- Summary: Backpressure rate is cast from double to long to double Key: SPARK-19114 URL: https://issues.apache.org/jira/browse/SPARK-19114 Project: Spark Issue Type: Bug Reporter: Tony Novak We have a Spark streaming job where each record takes well over a second to execute, so the stable rate is under 1 element/second. We set spark.streaming.backpressure.enabled=true and spark.streaming.backpressure.pid.minRate=0.1, but backpressure did not appear to be effective, even though the TRACE level logs from PIDRateEstimator showed that the new rate was 0.1. As it turns out, even though the minRate parameter is a Double, and the rate estimate generated by PIDRateEstimator is a Double as well, RateController casts the new rate to a Long. As a result, if the computed rate is less than 1, it's truncated to 0, which ends up being interpreted as "no limit". What's particularly confusing is that the Guava RateLimiter class takes a rate limit as a double, so the long value ends up being cast back to a double. Is there any reason not to keep the rate limit as a double all the way through? I'm happy to create a pull request if this makes sense. We encountered the bug on Spark 1.6.2, but it looks like the code in the master branch is still affected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee
[ https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804425#comment-15804425 ] Gaurav Shah edited comment on SPARK-9215 at 1/7/17 2:04 AM: [~tdas] I know this is an old pull request but was still wondering if you can help. I was wondering can we enhance this to make sure that we checkpoint only after blocks of data has been written. So we need not implement Spark checkpoint in the first place. Each block has a start and end seq number. was (Author: gaurav24): [~tdas] I know this is an old pull request but was still wondering if you can help. I was wondering can we enhance this to make sure that we checkpoint only after blocks of data has been written. So we need to implement Spark checkpoint in the first place. Each block has a start and end seq number. > Implement WAL-free Kinesis receiver that give at-least once guarantee > - > > Key: SPARK-9215 > URL: https://issues.apache.org/jira/browse/SPARK-9215 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 1.4.1 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.5.0 > > > Currently, the KinesisReceiver can loose some data in the case of certain > failures (receiver and driver failures). Using the write ahead logs can > mitigate some of the problem, but it is not ideal because WALs dont work with > S3 (eventually consistency, etc.) which is the most likely file system to be > used in the EC2 environment. Hence, we have to take a different approach to > improving reliability for Kinesis. > Detailed design doc - > https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-18372: Assignee: mingjie tang > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang >Assignee: mingjie tang > Fix For: 1.6.4 > > Attachments: _thumb_37664.png > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18372) .Hive-staging folders created from Spark hiveContext are not getting cleaned up
[ https://issues.apache.org/jira/browse/SPARK-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18372. - Resolution: Resolved Fix Version/s: 1.6.4 > .Hive-staging folders created from Spark hiveContext are not getting cleaned > up > --- > > Key: SPARK-18372 > URL: https://issues.apache.org/jira/browse/SPARK-18372 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.2, 1.6.3 > Environment: spark standalone and spark yarn >Reporter: mingjie tang > Fix For: 1.6.4 > > Attachments: _thumb_37664.png > > > Steps to reproduce: > > 1. Launch spark-shell > 2. Run the following scala code via Spark-Shell > scala> val hivesampletabledf = sqlContext.table("hivesampletable") > scala> import org.apache.spark.sql.DataFrameWriter > scala> val dfw : DataFrameWriter = hivesampletabledf.write > scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS hivesampletablecopypy ( > clientid string, querytime string, market string, deviceplatform string, > devicemake string, devicemodel string, state string, country string, > querydwelltime double, sessionid bigint, sessionpagevieworder bigint )") > scala> dfw.insertInto("hivesampletablecopypy") > scala> val hivesampletablecopypydfdf = sqlContext.sql("""SELECT clientid, > querytime, deviceplatform, querydwelltime FROM hivesampletablecopypy WHERE > state = 'Washington' AND devicemake = 'Microsoft' AND querydwelltime > 15 """) > hivesampletablecopypydfdf.show > 3. in HDFS (in our case, WASB), we can see the following folders > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693666-1/-ext-1 > > hive/warehouse/hivesampletablecopypy/.hive-staging_hive_2016-10-14_00-52-44_666_967373710066693 > the issue is that these don't get cleaned up and get accumulated > = > with the customer, we have tried setting "SET > hive.exec.stagingdir=/tmp/hive;" in hive-site.xml - didn't make any > difference. > .hive-staging folders are created under the folder - > hive/warehouse/hivesampletablecopypy/ > we have tried adding this property to hive-site.xml and restart the > components - > > hive.exec.stagingdir > $ {hive.exec.scratchdir} > /$ > {user.name} > /.staging > > a new .hive-staging folder was created in hive/warehouse/ folder > moreover, please understand that if we run the hive query in pure Hive via > Hive CLI on the same Spark cluster, we don't see the behavior > so it doesn't appear to be a Hive issue/behavior in this case- this is a > spark behavior > I checked in Ambari, spark.yarn.preserve.staging.files=false in Spark > configuration already > The issue happens via Spark-submit as well - customer used the following > command to reproduce this - > spark-submit test-hive-staging-cleanup.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806368#comment-15806368 ] Ilya Matiach commented on SPARK-17975: -- I was able to reproduce the issue based on your dataset and I've made the suggested fix in the pull request. I added a test case that had a similar issue to your dataset and could reproduce the error. Thank you! > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA
[jira] [Assigned] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17975: Assignee: (was: Apache Spark) > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17975: Assignee: Apache Spark > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein >Assignee: Apache Spark > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806365#comment-15806365 ] Apache Spark commented on SPARK-17975: -- User 'imatiach-msft' has created a pull request for this issue: https://github.com/apache/spark/pull/16494 > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Assigned] (SPARK-19093) Cached tables are not used in SubqueryExpression
[ https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19093: Assignee: (was: Apache Spark) > Cached tables are not used in SubqueryExpression > > > Key: SPARK-19093 > URL: https://issues.apache.org/jira/browse/SPARK-19093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Josh Rosen > > See reproduction at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html > Consider the following: > {code} > Seq(("a", "b"), ("c", "d")) > .toDS > .write > .parquet("/tmp/rows") > val df = spark.read.parquet("/tmp/rows") > df.cache() > df.count() > df.createOrReplaceTempView("rows") > spark.sql(""" > select * from rows cross join rows > """).explain(true) > spark.sql(""" > select * from rows where not exists (select * from rows) > """).explain(true) > {code} > In both plans, I'd expect that both sides of the joins would read from the > cached table for both the cross join and anti join, but the left anti join > produces the following plan which only reads the left side from cache and > reads the right side via a regular non-cahced scan: > {code} > == Parsed Logical Plan == > 'Project [*] > +- 'Filter NOT exists#3994 >: +- 'Project [*] >: +- 'UnresolvedRelation `rows` >+- 'UnresolvedRelation `rows` > == Analyzed Logical Plan == > _1: string, _2: string > Project [_1#3775, _2#3776] > +- Filter NOT predicate-subquery#3994 [] >: +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] >: +- Project [_1#3775, _2#3776] >:+- SubqueryAlias rows >: +- Relation[_1#3775,_2#3776] parquet >+- SubqueryAlias rows > +- Relation[_1#3775,_2#3776] parquet > == Optimized Logical Plan == > Join LeftAnti > :- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] >+- Relation[_1#3775,_2#3776] parquet > == Physical Plan == > BroadcastNestedLoopJoin BuildRight, LeftAnti > :- InMemoryTableScan [_1#3775, _2#3776] > : +- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- BroadcastExchange IdentityBroadcastMode >+- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] > +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19093) Cached tables are not used in SubqueryExpression
[ https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806285#comment-15806285 ] Apache Spark commented on SPARK-19093: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/16493 > Cached tables are not used in SubqueryExpression > > > Key: SPARK-19093 > URL: https://issues.apache.org/jira/browse/SPARK-19093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Josh Rosen > > See reproduction at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html > Consider the following: > {code} > Seq(("a", "b"), ("c", "d")) > .toDS > .write > .parquet("/tmp/rows") > val df = spark.read.parquet("/tmp/rows") > df.cache() > df.count() > df.createOrReplaceTempView("rows") > spark.sql(""" > select * from rows cross join rows > """).explain(true) > spark.sql(""" > select * from rows where not exists (select * from rows) > """).explain(true) > {code} > In both plans, I'd expect that both sides of the joins would read from the > cached table for both the cross join and anti join, but the left anti join > produces the following plan which only reads the left side from cache and > reads the right side via a regular non-cahced scan: > {code} > == Parsed Logical Plan == > 'Project [*] > +- 'Filter NOT exists#3994 >: +- 'Project [*] >: +- 'UnresolvedRelation `rows` >+- 'UnresolvedRelation `rows` > == Analyzed Logical Plan == > _1: string, _2: string > Project [_1#3775, _2#3776] > +- Filter NOT predicate-subquery#3994 [] >: +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] >: +- Project [_1#3775, _2#3776] >:+- SubqueryAlias rows >: +- Relation[_1#3775,_2#3776] parquet >+- SubqueryAlias rows > +- Relation[_1#3775,_2#3776] parquet > == Optimized Logical Plan == > Join LeftAnti > :- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] >+- Relation[_1#3775,_2#3776] parquet > == Physical Plan == > BroadcastNestedLoopJoin BuildRight, LeftAnti > :- InMemoryTableScan [_1#3775, _2#3776] > : +- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- BroadcastExchange IdentityBroadcastMode >+- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] > +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19093) Cached tables are not used in SubqueryExpression
[ https://issues.apache.org/jira/browse/SPARK-19093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19093: Assignee: Apache Spark > Cached tables are not used in SubqueryExpression > > > Key: SPARK-19093 > URL: https://issues.apache.org/jira/browse/SPARK-19093 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > See reproduction at > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1903098128019500/2699761537338853/1395282846718893/latest.html > Consider the following: > {code} > Seq(("a", "b"), ("c", "d")) > .toDS > .write > .parquet("/tmp/rows") > val df = spark.read.parquet("/tmp/rows") > df.cache() > df.count() > df.createOrReplaceTempView("rows") > spark.sql(""" > select * from rows cross join rows > """).explain(true) > spark.sql(""" > select * from rows where not exists (select * from rows) > """).explain(true) > {code} > In both plans, I'd expect that both sides of the joins would read from the > cached table for both the cross join and anti join, but the left anti join > produces the following plan which only reads the left side from cache and > reads the right side via a regular non-cahced scan: > {code} > == Parsed Logical Plan == > 'Project [*] > +- 'Filter NOT exists#3994 >: +- 'Project [*] >: +- 'UnresolvedRelation `rows` >+- 'UnresolvedRelation `rows` > == Analyzed Logical Plan == > _1: string, _2: string > Project [_1#3775, _2#3776] > +- Filter NOT predicate-subquery#3994 [] >: +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] >: +- Project [_1#3775, _2#3776] >:+- SubqueryAlias rows >: +- Relation[_1#3775,_2#3776] parquet >+- SubqueryAlias rows > +- Relation[_1#3775,_2#3776] parquet > == Optimized Logical Plan == > Join LeftAnti > :- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] >+- Relation[_1#3775,_2#3776] parquet > == Physical Plan == > BroadcastNestedLoopJoin BuildRight, LeftAnti > :- InMemoryTableScan [_1#3775, _2#3776] > : +- InMemoryRelation [_1#3775, _2#3776], true, 1, StorageLevel(disk, > memory, deserialized, 1 replicas) > : +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: > Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > +- BroadcastExchange IdentityBroadcastMode >+- *Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002] > +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], > PushedFilters: [], ReadSchema: struct<_1:string,_2:string> > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806218#comment-15806218 ] Randall Whitman commented on SPARK-7768: Would you mind commenting on how the updated UDT works with a class of an unmodified third-party library? Thanks in advance. > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7768: --- Assignee: Apache Spark > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7768: --- Assignee: (was: Apache Spark) > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public
[ https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806167#comment-15806167 ] Apache Spark commented on SPARK-7768: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/16478 > Make user-defined type (UDT) API public > --- > > Key: SPARK-7768 > URL: https://issues.apache.org/jira/browse/SPARK-7768 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Xiangrui Meng >Priority: Critical > > As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it > would be nice to make the UDT API public in 1.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19111) S3 Mesos history upload fails if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Allen updated SPARK-19111: -- Summary: S3 Mesos history upload fails if too large (was: S3 Mesos history upload fails if too large or if distributed datastore is misbehaving) > S3 Mesos history upload fails if too large > -- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload > 2017-01-06T21:59:44,679 INFO [main] >
[jira] [Assigned] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user
[ https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19113: Assignee: Apache Spark (was: Shixiong Zhu) > Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source > should be sent to the user > - > > Key: SPARK-19113 > URL: https://issues.apache.org/jira/browse/SPARK-19113 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user
[ https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15806085#comment-15806085 ] Apache Spark commented on SPARK-19113: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16492 > Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source > should be sent to the user > - > > Key: SPARK-19113 > URL: https://issues.apache.org/jira/browse/SPARK-19113 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user
[ https://issues.apache.org/jira/browse/SPARK-19113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19113: Assignee: Shixiong Zhu (was: Apache Spark) > Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source > should be sent to the user > - > > Key: SPARK-19113 > URL: https://issues.apache.org/jira/browse/SPARK-19113 > Project: Spark > Issue Type: Test > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19111) S3 Mesos history upload fails silently if too large
[ https://issues.apache.org/jira/browse/SPARK-19111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Allen updated SPARK-19111: -- Summary: S3 Mesos history upload fails silently if too large (was: S3 Mesos history upload fails if too large) > S3 Mesos history upload fails silently if too large > --- > > Key: SPARK-19111 > URL: https://issues.apache.org/jira/browse/SPARK-19111 > Project: Spark > Issue Type: Bug > Components: EC2, Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > {code} > 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped > Spark web UI at http://REDACTED:4041 > 2017-01-06T21:32:32,938 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.jvmGCTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBlocksFetched > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSerializationTime > 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate( > 364,WrappedArray()) > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.resultSize > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.peakExecutionMemory > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.fetchWaitTime > 2017-01-06T21:32:32,939 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.memoryBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.diskBytesSpilled > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.localBytesRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.recordsRead > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorDeserializeTime > 2017-01-06T21:32:32,940 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.executorRunTime > 2017-01-06T21:32:32,941 INFO [SparkListenerBus] > com.metamx.starfire.spark.SparkDriver - emitting metric: > internal.metrics.shuffle.read.remoteBlocksFetched > 2017-01-06T21:32:32,943 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' > closed. Now beginning upload > 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) > 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) > 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] > org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already > stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) > {code} > Running spark on mesos, some large jobs fail to upload to the history server > storage! > A successful sequence of events in the log that yield an upload are as > follows: > {code} > 2017-01-06T19:14:32,925 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' > 2017-01-06T21:59:14,789 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key > 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' > closed. Now beginning upload > 2017-01-06T21:59:44,679 INFO [main] > org.apache.hadoop.fs.s3native.NativeS3FileSystem
[jira] [Created] (SPARK-19113) Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user
Shixiong Zhu created SPARK-19113: Summary: Fix flaky test: o.a.s.sql.streaming.StreamSuite fatal errors from a source should be sent to the user Key: SPARK-19113 URL: https://issues.apache.org/jira/browse/SPARK-19113 Project: Spark Issue Type: Test Components: Structured Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19112) add codec for ZStandard
Thomas Graves created SPARK-19112: - Summary: add codec for ZStandard Key: SPARK-19112 URL: https://issues.apache.org/jira/browse/SPARK-19112 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Thomas Graves ZStandard: https://github.com/facebook/zstd and http://facebook.github.io/zstd/ has been in use for a while now. v1.0 was recently released. Hadoop (https://issues.apache.org/jira/browse/HADOOP-13578) and others (https://issues.apache.org/jira/browse/KAFKA-4514) are adopting it. Zstd seems to give great results => Gzip level Compression with Lz4 level CPU. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19111) S3 Mesos history upload fails if too large or if distributed datastore is misbehaving
Charles Allen created SPARK-19111: - Summary: S3 Mesos history upload fails if too large or if distributed datastore is misbehaving Key: SPARK-19111 URL: https://issues.apache.org/jira/browse/SPARK-19111 Project: Spark Issue Type: Bug Components: EC2, Mesos, Spark Core Affects Versions: 2.0.0 Reporter: Charles Allen {code} 2017-01-06T21:32:32,928 INFO [main] org.apache.spark.ui.SparkUI - Stopped Spark web UI at http://REDACTED:4041 2017-01-06T21:32:32,938 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.jvmGCTime 2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.localBlocksFetched 2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.resultSerializationTime 2017-01-06T21:32:32,939 ERROR [heartbeat-receiver-event-loop-thread] org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate( 364,WrappedArray()) 2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.resultSize 2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.peakExecutionMemory 2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.fetchWaitTime 2017-01-06T21:32:32,939 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.memoryBytesSpilled 2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.remoteBytesRead 2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.diskBytesSpilled 2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.localBytesRead 2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.recordsRead 2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.executorDeserializeTime 2017-01-06T21:32:32,940 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: output/bytes 2017-01-06T21:32:32,941 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.executorRunTime 2017-01-06T21:32:32,941 INFO [SparkListenerBus] com.metamx.starfire.spark.SparkDriver - emitting metric: internal.metrics.shuffle.read.remoteBlocksFetched 2017-01-06T21:32:32,943 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1387.inprogress' closed. Now beginning upload 2017-01-06T21:32:32,963 ERROR [heartbeat-receiver-event-loop-thread] org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(905,WrappedArray()) 2017-01-06T21:32:32,973 ERROR [heartbeat-receiver-event-loop-thread] org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(519,WrappedArray()) 2017-01-06T21:32:32,988 ERROR [heartbeat-receiver-event-loop-thread] org.apache.spark.scheduler.LiveListenerBus - SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(596,WrappedArray()) {code} Running spark on mesos, some large jobs fail to upload to the history server storage! A successful sequence of events in the log that yield an upload are as follows: {code} 2017-01-06T19:14:32,925 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' writing to tempfile '/mnt/tmp/hadoop/output-2516573909248961808.tmp' 2017-01-06T21:59:14,789 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' closed. Now beginning upload 2017-01-06T21:59:44,679 INFO [main] org.apache.hadoop.fs.s3native.NativeS3FileSystem - OutputStream for key 'eventLogs/remnant/46bf8f87-6de6-4da8-9cba-5b2fecd0875e-1434.inprogress' upload complete {code} But large jobs do not ever get to the {{upload complete}} log message, and instead exit before completion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Closed] (SPARK-18710) Add offset to GeneralizedLinearRegression models
[ https://issues.apache.org/jira/browse/SPARK-18710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wayne Zhang closed SPARK-18710. --- Resolution: Unresolved > Add offset to GeneralizedLinearRegression models > > > Key: SPARK-18710 > URL: https://issues.apache.org/jira/browse/SPARK-18710 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.2 >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Labels: features > Original Estimate: 10h > Remaining Estimate: 10h > > The current GeneralizedLinearRegression model does not support offset. The > offset can be useful to take into account exposure, or for testing > incremental effect of new variables. It is possible to use weights in current > environment to achieve the same effect of specifying offset for certain > models, e.g., Poisson & Binomial with log offset, it is desirable to have the > offset option to work with more general cases, e.g., negative offset or > offset that is hard to specify using weights (e.g., offset to the probability > rather than odds in logistic regression). > Effort would involve: > * update regression class to support offsetCol > * update IWLS to take into account of offset > * add test case for offset > I can start working on this if the community approves this feature. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model
[ https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19110: Assignee: Apache Spark > DistributedLDAModel returns different logPrior for original and loaded model > > > Key: SPARK-19110 > URL: https://issues.apache.org/jira/browse/SPARK-19110 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Miao Wang >Assignee: Apache Spark > > While adding DistributedLDAModel training summary for SparkR, I found that > the logPrior for original and loaded model is different. > For example, in the test("read/write DistributedLDAModel"), I add the test: > val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior > val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior > assert(logPrior === logPrior2) > The test fails: > -4.394180878889078 did not equal -4.294290536919573 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model
[ https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805688#comment-15805688 ] Apache Spark commented on SPARK-19110: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/16491 > DistributedLDAModel returns different logPrior for original and loaded model > > > Key: SPARK-19110 > URL: https://issues.apache.org/jira/browse/SPARK-19110 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Miao Wang > > While adding DistributedLDAModel training summary for SparkR, I found that > the logPrior for original and loaded model is different. > For example, in the test("read/write DistributedLDAModel"), I add the test: > val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior > val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior > assert(logPrior === logPrior2) > The test fails: > -4.394180878889078 did not equal -4.294290536919573 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model
[ https://issues.apache.org/jira/browse/SPARK-19110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19110: Assignee: (was: Apache Spark) > DistributedLDAModel returns different logPrior for original and loaded model > > > Key: SPARK-19110 > URL: https://issues.apache.org/jira/browse/SPARK-19110 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Miao Wang > > While adding DistributedLDAModel training summary for SparkR, I found that > the logPrior for original and loaded model is different. > For example, in the test("read/write DistributedLDAModel"), I add the test: > val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior > val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior > assert(logPrior === logPrior2) > The test fails: > -4.394180878889078 did not equal -4.294290536919573 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19110) DistributedLDAModel returns different logPrior for original and loaded model
Miao Wang created SPARK-19110: - Summary: DistributedLDAModel returns different logPrior for original and loaded model Key: SPARK-19110 URL: https://issues.apache.org/jira/browse/SPARK-19110 Project: Spark Issue Type: Bug Components: ML, MLlib Reporter: Miao Wang While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different. For example, in the test("read/write DistributedLDAModel"), I add the test: val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior assert(logPrior === logPrior2) The test fails: -4.394180878889078 did not equal -4.294290536919573 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19109) ORC metadata section can sometimes exceed protobuf message size limit
[ https://issues.apache.org/jira/browse/SPARK-19109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nic Eggert updated SPARK-19109: --- Description: Basically, Spark inherits HIVE-11592 from its Hive dependency. From that issue: If there are too many small stripes and with many columns, the overhead for storing metadata (column stats) can exceed the default protobuf message size of 64MB. Reading such files will throw the following exception {code} Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755) at com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811) at com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1331) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1281) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4887) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4803) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12925) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12872) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13599) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13546) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630) at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.(ReaderImpl.java:468) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:314) at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228) at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) {code} This is fixed in Hive 1.3, so it should be fairly straightforward to pick up the patch. As a side note: Spark's management of its Hive fork/dependency seems incredibly arcane to me. Surely there's a better way than publishing to central from developers' personal repos. was: Basically, Spark inherits HIVE-11592 from its Hive dependency. From that issue: If there are too many small stripes and with many columns, the overhead for storing metadata (column stats) can exceed the default protobuf message size of 64MB. Reading such files will throw the following exception {code} Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at
[jira] [Created] (SPARK-19109) ORC metadata section can sometimes exceed protobuf message size limit
Nic Eggert created SPARK-19109: -- Summary: ORC metadata section can sometimes exceed protobuf message size limit Key: SPARK-19109 URL: https://issues.apache.org/jira/browse/SPARK-19109 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 1.6.3, 2.2.0 Reporter: Nic Eggert Basically, Spark inherits HIVE-11592 from its Hive dependency. From that issue: If there are too many small stripes and with many columns, the overhead for storing metadata (column stats) can exceed the default protobuf message size of 64MB. Reading such files will throw the following exception {code} Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException: Protocol message was too large. May be malicious. Use CodedInputStream.setSizeLimit() to increase the size limit. at com.google.protobuf.InvalidProtocolBufferException.sizeLimitExceeded(InvalidProtocolBufferException.java:110) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:755) at com.google.protobuf.CodedInputStream.readRawBytes(CodedInputStream.java:811) at com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:329) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1331) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1281) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1369) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4887) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics.(OrcProto.java:4803) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4990) at org.apache.hadoop.hive.ql.io.orc.OrcProto$ColumnStatistics$1.parsePartialFrom(OrcProto.java:4985) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12925) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics.(OrcProto.java:12872) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12961) at org.apache.hadoop.hive.ql.io.orc.OrcProto$StripeStatistics$1.parsePartialFrom(OrcProto.java:12956) at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13599) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.(OrcProto.java:13546) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13635) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata$1.parsePartialFrom(OrcProto.java:13630) at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:217) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:223) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.hadoop.hive.ql.io.orc.OrcProto$Metadata.parseFrom(OrcProto.java:13746) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl$MetaInfoObjExtractor.(ReaderImpl.java:468) at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.(ReaderImpl.java:314) at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:228) at org.apache.hadoop.hive.ql.io.orc.FileDump.main(FileDump.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) {code} This is fixed in Hive 1.3, so it should be fairly straightforward to pick up the patch. As a side note: Spark's management of its Hive fork/dependency seems incredibly arcane to me. Surely there's a better way than publishing to central from developers personal repos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6099) Stabilize mllib ClassificationModel, RegressionModel APIs
[ https://issues.apache.org/jira/browse/SPARK-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6099. -- Resolution: Done > Stabilize mllib ClassificationModel, RegressionModel APIs > - > > Key: SPARK-6099 > URL: https://issues.apache.org/jira/browse/SPARK-6099 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > The abstractions spark.mllib.classification.ClassificationModel and > spark.mllib.regression.RegressionModel have been Experimental for a while. > This is a problem since some of the implementing classes are not Experimental > (e.g., LogisticRegressionModel). > We should finalize the API and make it non-Experimental ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6098) Propagate Experimental tag to child classes
[ https://issues.apache.org/jira/browse/SPARK-6098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6098. -- Resolution: Not A Problem This is defunct now that these aren't even experimental > Propagate Experimental tag to child classes > --- > > Key: SPARK-6098 > URL: https://issues.apache.org/jira/browse/SPARK-6098 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Issue: An abstraction (e.g., mllib.classification.ClassificationModel) may be > Experimental even when its implementing classes (e.g., > mllib.classification.LogisticRegressionModel) are not. > Proposal: That tag should be propagated to child classes (or better yet to > the relevant parts of the child classes). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-18890: --- Issue Type: Improvement (was: Bug) > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805495#comment-15805495 ] Kay Ousterhout commented on SPARK-18890: I just opened SPARK-19108 for the broadcast issue. In the meantime, after thinking about this more (and also based on your comments on the associated PRs Imran) I think we should go ahead and merge this change to consolidate the serialization in one place. If nothing else, that change makes the code more readable, and I suspect will make it easier to implement further optimizations to the serialization in the future. > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)
Kay Ousterhout created SPARK-19108: -- Summary: Broadcast all shared parts of tasks (to reduce task serialization time) Key: SPARK-19108 URL: https://issues.apache.org/jira/browse/SPARK-19108 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: Kay Ousterhout Expand the amount of information that's broadcasted for tasks, to avoid serializing data per-task that should only be sent to each executor once for the entire stage. Conceptually, this means we'd have new classes specially for sending the minimal necessary data to the executor, like: {code} /** * metadata about the taskset needed by the executor for all tasks in this taskset. Subset of the * full data kept on the driver to make it faster to serialize and send to executors. */ class ExecutorTaskSetMeta( val stageId: Int, val stageAttemptId: Int, val properties: Properties, val addedFiles: Map[String, String], val addedJars: Map[String, String] // maybe task metrics here? ) class ExecutorTaskData( val partitionId: Int, val attemptNumber: Int, val taskId: Long, val taskBinary: Broadcast[Array[Byte]], val taskSetMeta: Broadcast[ExecutorTaskSetMeta] ) {code} Then all the info you'd need to send to the executors would be a serialized version of ExecutorTaskData. Furthermore, given the simplicity of that class, you could serialize manually, and then for each task you could just modify the first two ints & one long directly in the byte buffer. (You could do the same trick for serialization even if ExecutorTaskSetMeta was not a broadcast, but that will keep the msgs small as well.) There a bunch of details I'm skipping here: you'd also need to do some special handling for the TaskMetrics; the way tasks get started in the executor would change; you'd also need to refactor {{Task}} to let it get reconstructed from this information (or add more to ExecutorTaskSetMeta); and probably other details I'm overlooking now. (this is copied from SPARK-18890 and [~imranr]'s comment there; cc [~shivaram]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19074) Update Structured Streaming Programming guide for Update Mode
[ https://issues.apache.org/jira/browse/SPARK-19074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-19074. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16468 [https://github.com/apache/spark/pull/16468] > Update Structured Streaming Programming guide for Update Mode > - > > Key: SPARK-19074 > URL: https://issues.apache.org/jira/browse/SPARK-19074 > Project: Spark > Issue Type: Improvement > Components: Documentation, Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout closed SPARK-3937. - > Unsafe memory access inside of Snappy library > - > > Key: SPARK-3937 > URL: https://issues.apache.org/jira/browse/SPARK-3937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: Patrick Wendell > > This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't > have much information about this other than the stack trace. However, it was > concerning enough I figured I should post it. > {code} > java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code > org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) > org.xerial.snappy.Snappy.uncompress(Snappy.java:480) > > org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) > > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) > > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > > java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) > > java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) > java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > scala.collection.Iterator$class.foreach(Iterator.scala:727) > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > scala.collection.AbstractIterator.to(Iterator.scala:1157) > > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) > > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) > > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) > > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > org.apache.spark.scheduler.Task.run(Task.scala:56) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands,
[jira] [Resolved] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-3937. --- Resolution: Won't Fix Closing this due to lack of activity / reports of issues on recent versions of Spark > Unsafe memory access inside of Snappy library > - > > Key: SPARK-3937 > URL: https://issues.apache.org/jira/browse/SPARK-3937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: Patrick Wendell > > This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't > have much information about this other than the stack trace. However, it was > concerning enough I figured I should post it. > {code} > java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code > org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) > org.xerial.snappy.Snappy.uncompress(Snappy.java:480) > > org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) > > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) > > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > > java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) > > java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) > java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > scala.collection.Iterator$class.foreach(Iterator.scala:727) > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > scala.collection.AbstractIterator.to(Iterator.scala:1157) > > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) > > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) > > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) > > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > org.apache.spark.scheduler.Task.run(Task.scala:56) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805366#comment-15805366 ] Seth Hendrickson commented on SPARK-10078: -- As a part of [SPARK-17136|https://issues.apache.org/jira/browse/SPARK-17136]. I am looking into a design for generic optimizer interface for Spark.ML. This should ideally, be abstracted such that, as Yanbo mentioned, users can switch between them easily. I don't think adding this to Breeze is important since we hope to add our own interface directly into Spark. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6099) Stabilize mllib ClassificationModel, RegressionModel APIs
[ https://issues.apache.org/jira/browse/SPARK-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805232#comment-15805232 ] Ilya Matiach commented on SPARK-6099: - It doesn't look like the API's are experimental anymore, can this JIRA be closed? > Stabilize mllib ClassificationModel, RegressionModel APIs > - > > Key: SPARK-6099 > URL: https://issues.apache.org/jira/browse/SPARK-6099 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > The abstractions spark.mllib.classification.ClassificationModel and > spark.mllib.regression.RegressionModel have been Experimental for a while. > This is a problem since some of the implementing classes are not Experimental > (e.g., LogisticRegressionModel). > We should finalize the API and make it non-Experimental ASAP. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805205#comment-15805205 ] Marcelo Vanzin commented on SPARK-5493: --- > Since keytab will be owned by a "service" account, not by proxied users, and > keytab file will have proper OS permissions, not sure I'm following how > keytab would be exposed to those proxied users? When you use the principal / keytab options in spark-submit, Spark uploads the keytab to HDFS, under the user running the application (in this case, the proxy user). > Support proxy users under kerberos > -- > > Key: SPARK-5493 > URL: https://issues.apache.org/jira/browse/SPARK-5493 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Brock Noland >Assignee: Marcelo Vanzin > Fix For: 1.3.0 > > > When using kerberos, services may want to use spark-submit to submit jobs as > a separate user. For example a service like hive might want to submit jobs as > a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805198#comment-15805198 ] Ruslan Dautkhanov edited comment on SPARK-5493 at 1/6/17 6:19 PM: -- {quote}There might be ways to hack support for that without changes in Spark, but I'd like to see a proper API in Spark for distributing new delegation tokens. I mentioned that in SPARK-14743, but although that bug is closed, that particular feature hasn't been implemented yet. {quote} [~vanzin], would it be possible to submit a new jira for this part that didn't get implemented in SPARK-14743? Thank you. {quote} There might be ways to hack support for that without changes in Spark {quote} Something like ssh'ing regularly into all Hadoop nodes under proxied user id and running kinit? Yep, a proper API would be better here. {quote} It would expose the keytab to the proxied user, which in 99% of the cases is not wanted. {quote} Since keytab will be owned by a "service" account, not by proxied users, and keytab file will have proper OS permissions, not sure I'm following how keytab would be exposed to those proxied users? Could you please elaborate. Proxy authentication is only for Hadoop services. keytab is just a file and we could rely on OS permissions to lock its access. I'm probably missing something here. was (Author: tagar): {quote}There might be ways to hack support for that without changes in Spark, but I'd like to see a proper API in Spark for distributing new delegation tokens. I mentioned that in SPARK-14743, but although that bug is closed, that particular feature hasn't been implemented yet. {quote} [~vanzin], would it be possible to submit a new jira for this part that didn't get implemented in SPARK-14743? Thank you. {quote} There might be ways to hack support for that without changes in Spark {quote} Something like ssh'ing regularly into all Hadoop nodes under proxied user id and running kinit? Yep, a proper API would be better here. {quote} It would expose the keytab to the proxied user, which in 99% of the cases is not wanted. {quote} Since keytab will be owned by a "service" account, not by proxied users, and keytab file will have proper OS permissions, not sure I'm following how keytab would be exposed to those proxied users? Could you please elaborate. Proxy authentication is only for Hadoop services. keytab is just a file and we could rely on OS permissions to lock its access, relying on regular OS permissions. I'm probably missing something here. > Support proxy users under kerberos > -- > > Key: SPARK-5493 > URL: https://issues.apache.org/jira/browse/SPARK-5493 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Brock Noland >Assignee: Marcelo Vanzin > Fix For: 1.3.0 > > > When using kerberos, services may want to use spark-submit to submit jobs as > a separate user. For example a service like hive might want to submit jobs as > a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5493) Support proxy users under kerberos
[ https://issues.apache.org/jira/browse/SPARK-5493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805198#comment-15805198 ] Ruslan Dautkhanov commented on SPARK-5493: -- {quote}There might be ways to hack support for that without changes in Spark, but I'd like to see a proper API in Spark for distributing new delegation tokens. I mentioned that in SPARK-14743, but although that bug is closed, that particular feature hasn't been implemented yet. {quote} [~vanzin], would it be possible to submit a new jira for this part that didn't get implemented in SPARK-14743? Thank you. {quote} There might be ways to hack support for that without changes in Spark {quote} Something like ssh'ing regularly into all Hadoop nodes under proxied user id and running kinit? Yep, a proper API would be better here. {quote} It would expose the keytab to the proxied user, which in 99% of the cases is not wanted. {quote} Since keytab will be owned by a "service" account, not by proxied users, and keytab file will have proper OS permissions, not sure I'm following how keytab would be exposed to those proxied users? Could you please elaborate. Proxy authentication is only for Hadoop services. keytab is just a file and we could rely on OS permissions to lock its access, relying on regular OS permissions. I'm probably missing something here. > Support proxy users under kerberos > -- > > Key: SPARK-5493 > URL: https://issues.apache.org/jira/browse/SPARK-5493 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Brock Noland >Assignee: Marcelo Vanzin > Fix For: 1.3.0 > > > When using kerberos, services may want to use spark-submit to submit jobs as > a separate user. For example a service like hive might want to submit jobs as > a client user. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC
[ https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805169#comment-15805169 ] Ilya Matiach commented on SPARK-11968: -- Can someone with permissions change the status from In Progress to Open - as the pull request sent was closed and the issue still exists. > ALS recommend all methods spend most of time in GC > -- > > Key: SPARK-11968 > URL: https://issues.apache.org/jira/browse/SPARK-11968 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.2, 1.6.0 >Reporter: Joseph K. Bradley > > After adding recommendUsersForProducts and recommendProductsForUsers to ALS > in spark-perf, I noticed that it takes much longer than ALS itself. Looking > at the monitoring page, I can see it is spending about 8min doing GC for each > 10min task. That sounds fixable. Looking at the implementation, there is > clearly an opportunity to avoid extra allocations: > [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283] > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19083) sbin/start-history-server.sh scripts use of $@ without ""
[ https://issues.apache.org/jira/browse/SPARK-19083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19083. Resolution: Fixed Assignee: zuotingbing Fix Version/s: 2.2.0 2.1.1 > sbin/start-history-server.sh scripts use of $@ without "" > - > > Key: SPARK-19083 > URL: https://issues.apache.org/jira/browse/SPARK-19083 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: linux >Reporter: zuotingbing >Assignee: zuotingbing >Priority: Trivial > Fix For: 2.1.1, 2.2.0 > > > sbin/start-history-server.sh script use of $@ without "" , this will affect > the length of args which used in HistoryServerArguments::parse(args: > List[String]) > should write as follows: > exec ... org.apache.spark.deploy.history.HistoryServer 1 "$@" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805087#comment-15805087 ] Imran Rashid commented on SPARK-18890: -- [~gq] I think you misunderstood my suggestion about using a broadcast for the task. I'm not suggesting using a broadcast to contain *all* the task information, only the information which is shared across all tasks in a taskset. eg., the preferred location is ignored on the executor, so we wouldn't even bother serializing it either. Conceptually, this means we'd have new classes specially for sending the minimal necessary data to the executor, like: {code} /** * metadata about the taskset needed by the executor for all tasks in this taskset. Subset of the * full data kept on the driver to make it faster to serialize and send to executors. */ class ExecutorTaskSetMeta( val stageId: Int, val stageAttemptId: Int, val properties: Properties, val addedFiles: Map[String, String], val addedJars: Map[String, String] // maybe task metrics here? ) class ExecutorTaskData( val partitionId: Int, val attemptNumber: Int, val taskId: Long, val taskBinary: Broadcast[Array[Byte]], val taskSetMeta: Broadcast[ExecutorTaskSetMeta] ) {code} Then all the info you'd need to send to the executors would be a serialized version of ExecutorTaskData. Furthermore, given the simplicity of that class, you could serialize manually, and then for each task you could just modify the first two ints & one long directly in the byte buffer. (You could do the same trick for serialization even if ExecutorTaskSetMeta was not a broadcast, but that will keep the msgs small as well.) There a bunch of details I'm skipping here: you'd also need to do some special handling for the TaskMetrics; the way tasks get started in the executor would change; you'd also need to refactor {{Task}} to let it get reconstructed from this information (or add more to ExecutorTaskSetMeta); and probably other details I'm overlooking now. But if we really see task serialization as an issue, this seems like the right approach. > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19106) Styling for the configuration docs is broken
[ https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19106: Assignee: Apache Spark > Styling for the configuration docs is broken > > > Key: SPARK-19106 > URL: https://issues.apache.org/jira/browse/SPARK-19106 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Nicholas Chammas >Assignee: Apache Spark >Priority: Trivial > Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png > > > There are several styling problems with the configuration docs, starting > roughly from the Scheduling section on down. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18836) Serialize Task Metrics once per stage
[ https://issues.apache.org/jira/browse/SPARK-18836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805042#comment-15805042 ] Apache Spark commented on SPARK-18836: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/16489 > Serialize Task Metrics once per stage > - > > Key: SPARK-18836 > URL: https://issues.apache.org/jira/browse/SPARK-18836 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman > Fix For: 2.2.0 > > > Right now we serialize the empty task metrics once per task -- Since this is > shared across all tasks we could use the same serialized task metrics across > all tasks of a stage -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19106) Styling for the configuration docs is broken
[ https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805043#comment-15805043 ] Apache Spark commented on SPARK-19106: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/16490 > Styling for the configuration docs is broken > > > Key: SPARK-19106 > URL: https://issues.apache.org/jira/browse/SPARK-19106 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Nicholas Chammas >Priority: Trivial > Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png > > > There are several styling problems with the configuration docs, starting > roughly from the Scheduling section on down. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19106) Styling for the configuration docs is broken
[ https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19106: Assignee: (was: Apache Spark) > Styling for the configuration docs is broken > > > Key: SPARK-19106 > URL: https://issues.apache.org/jira/browse/SPARK-19106 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Nicholas Chammas >Priority: Trivial > Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png > > > There are several styling problems with the configuration docs, starting > roughly from the Scheduling section on down. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19106) Styling for the configuration docs is broken
[ https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805037#comment-15805037 ] Sean Owen commented on SPARK-19106: --- Yeah, the section headings aren't rendering as section titles. Not a big deal but should be fixed. PR coming. > Styling for the configuration docs is broken > > > Key: SPARK-19106 > URL: https://issues.apache.org/jira/browse/SPARK-19106 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Nicholas Chammas >Priority: Trivial > Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png > > > There are several styling problems with the configuration docs, starting > roughly from the Scheduling section on down. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-17931: - Assignee: Kay Ousterhout > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li >Assignee: Kay Ousterhout > Fix For: 2.2.0 > > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17931) taskScheduler has some unneeded serialization
[ https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-17931. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16053 [https://github.com/apache/spark/pull/16053] > taskScheduler has some unneeded serialization > - > > Key: SPARK-17931 > URL: https://issues.apache.org/jira/browse/SPARK-17931 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Guoqiang Li > Fix For: 2.2.0 > > > In the existing code, there are three layers of serialization > involved in sending a task from the scheduler to an executor: > - A Task object is serialized > - The Task object is copied to a byte buffer that also > contains serialized information about any additional JARs, > files, and Properties needed for the task to execute. This > byte buffer is stored as the member variable serializedTask > in the TaskDescription class. > - The TaskDescription is serialized (in addition to the serialized > task + JARs, the TaskDescription class contains the task ID and > other metadata) and sent in a LaunchTask message. > While it is necessary to have two layers of serialization, so that > the JAR, file, and Property info can be deserialized prior to > deserializing the Task object, the third layer of deserialization is > unnecessary (this is as a result of SPARK-2521). We should > eliminate a layer of serialization by moving the JARs, files, and Properties > into the TaskDescription class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804899#comment-15804899 ] Yanbo Liang edited comment on SPARK-10078 at 1/6/17 4:27 PM: - [~debasish83] We are aim to implement VL-BFGS as an optimizer for ~billion features in the peer position compared with Breeze LBFGS/OWLQN, and then ML algorithms can switch between them automatically based on the number of features. So an abstract interface between the algorithms and optimizers is absolutely necessary. To the VL-BFGS, I have a basic implementation at https://github.com/yanboliang/spark-vlbfgs, please feel free to review and comment the code. Thanks. was (Author: yanboliang): [~debasish83] We are aim to implement VL-BFGS as an optimizer which should be similar with Breeze LBFGS/OWLQN, and switching between them should be automatically based on the number of features. So an abstract interface between the algorithm and optimizer is really necessary. I have a basic implementation at https://github.com/yanboliang/spark-vlbfgs, please feel free to review and comment the code. Thanks. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804899#comment-15804899 ] Yanbo Liang commented on SPARK-10078: - [~debasish83] We are aim to implement VL-BFGS as an optimizer which should be similar with Breeze LBFGS/OWLQN, and switching between them should be automatically based on the number of features. So an abstract interface between the algorithm and optimizer is really necessary. I have a basic implementation at https://github.com/yanboliang/spark-vlbfgs, please feel free to review and comment the code. Thanks. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19033) HistoryServer still uses old ACLs even if ACLs are updated
[ https://issues.apache.org/jira/browse/SPARK-19033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-19033. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > HistoryServer still uses old ACLs even if ACLs are updated > -- > > Key: SPARK-19033 > URL: https://issues.apache.org/jira/browse/SPARK-19033 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.1.1, 2.2.0 > > > In the current implementation of HistoryServer, Application ACLs is picked > from event log rather than configuration: > {code} > val uiAclsEnabled = > conf.getBoolean("spark.history.ui.acls.enable", false) > ui.getSecurityManager.setAcls(uiAclsEnabled) > // make sure to set admin acls before view acls so they are > properly picked up > > ui.getSecurityManager.setAdminAcls(appListener.adminAcls.getOrElse("")) > ui.getSecurityManager.setViewAcls(attempt.sparkUser, > appListener.viewAcls.getOrElse("")) > > ui.getSecurityManager.setAdminAclsGroups(appListener.adminAclsGroups.getOrElse("")) > > ui.getSecurityManager.setViewAclsGroups(appListener.viewAclsGroups.getOrElse("")) > {code} > This will become a problem when ACLs is updated (newly added admin), only the > new application can be effected, the old applications were still using the > old ACLs. So these new admin still cannot check the logs of old applications. > It is hard to say this is a bug, but in our scenario this is not the expected > behavior we wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19033) HistoryServer still uses old ACLs even if ACLs are updated
[ https://issues.apache.org/jira/browse/SPARK-19033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-19033: -- Assignee: Saisai Shao > HistoryServer still uses old ACLs even if ACLs are updated > -- > > Key: SPARK-19033 > URL: https://issues.apache.org/jira/browse/SPARK-19033 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > > In the current implementation of HistoryServer, Application ACLs is picked > from event log rather than configuration: > {code} > val uiAclsEnabled = > conf.getBoolean("spark.history.ui.acls.enable", false) > ui.getSecurityManager.setAcls(uiAclsEnabled) > // make sure to set admin acls before view acls so they are > properly picked up > > ui.getSecurityManager.setAdminAcls(appListener.adminAcls.getOrElse("")) > ui.getSecurityManager.setViewAcls(attempt.sparkUser, > appListener.viewAcls.getOrElse("")) > > ui.getSecurityManager.setAdminAclsGroups(appListener.adminAclsGroups.getOrElse("")) > > ui.getSecurityManager.setViewAclsGroups(appListener.viewAclsGroups.getOrElse("")) > {code} > This will become a problem when ACLs is updated (newly added admin), only the > new application can be effected, the old applications were still using the > old ACLs. So these new admin still cannot check the logs of old applications. > It is hard to say this is a bug, but in our scenario this is not the expected > behavior we wanted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10078) Vector-free L-BFGS
[ https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804882#comment-15804882 ] Yanbo Liang commented on SPARK-10078: - [~sethah] The description is a little misleading, it means the VL-BFGS implementation can fit the current API. Feature partitioning (VL-BFGS) or not (Breeze LBFGS) will be choose automatically depends on the number of features. The purpose of VL-BFGS is not to replace Breeze LBFGS, but as a complementary method. Thanks. > Vector-free L-BFGS > -- > > Key: SPARK-10078 > URL: https://issues.apache.org/jira/browse/SPARK-10078 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > This is to implement a scalable version of vector-free L-BFGS > (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf). > Design document: > https://docs.google.com/document/d/1VGKxhg-D-6-vZGUAZ93l3ze2f3LBvTjfHRFVpX68kaw/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19107) support creating hive table with DataFrameWriter and Catalog
[ https://issues.apache.org/jira/browse/SPARK-19107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19107: Assignee: Apache Spark (was: Wenchen Fan) > support creating hive table with DataFrameWriter and Catalog > > > Key: SPARK-19107 > URL: https://issues.apache.org/jira/browse/SPARK-19107 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19107) support creating hive table with DataFrameWriter and Catalog
[ https://issues.apache.org/jira/browse/SPARK-19107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19107: Assignee: Wenchen Fan (was: Apache Spark) > support creating hive table with DataFrameWriter and Catalog > > > Key: SPARK-19107 > URL: https://issues.apache.org/jira/browse/SPARK-19107 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19107) support creating hive table with DataFrameWriter and Catalog
[ https://issues.apache.org/jira/browse/SPARK-19107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804809#comment-15804809 ] Apache Spark commented on SPARK-19107: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/16487 > support creating hive table with DataFrameWriter and Catalog > > > Key: SPARK-19107 > URL: https://issues.apache.org/jira/browse/SPARK-19107 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19107) support creating hive table with DataFrameWriter and Catalog
Wenchen Fan created SPARK-19107: --- Summary: support creating hive table with DataFrameWriter and Catalog Key: SPARK-19107 URL: https://issues.apache.org/jira/browse/SPARK-19107 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19106) Styling for the configuration docs is broken
Nicholas Chammas created SPARK-19106: Summary: Styling for the configuration docs is broken Key: SPARK-19106 URL: https://issues.apache.org/jira/browse/SPARK-19106 Project: Spark Issue Type: Bug Components: Documentation Reporter: Nicholas Chammas Priority: Trivial Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png There are several styling problems with the configuration docs, starting roughly from the Scheduling section on down. http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19106) Styling for the configuration docs is broken
[ https://issues.apache.org/jira/browse/SPARK-19106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-19106: - Attachment: Screen Shot 2017-01-06 at 10.20.52 AM.png > Styling for the configuration docs is broken > > > Key: SPARK-19106 > URL: https://issues.apache.org/jira/browse/SPARK-19106 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Nicholas Chammas >Priority: Trivial > Attachments: Screen Shot 2017-01-06 at 10.20.52 AM.png > > > There are several styling problems with the configuration docs, starting > roughly from the Scheduling section on down. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19086) Improper scoping of name resolution of columns in HAVING clause
[ https://issues.apache.org/jira/browse/SPARK-19086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nattavut Sutyanyong closed SPARK-19086. --- Resolution: Not A Problem > Improper scoping of name resolution of columns in HAVING clause > --- > > Key: SPARK-19086 > URL: https://issues.apache.org/jira/browse/SPARK-19086 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong >Priority: Minor > > There seems to be a problem on the scoping of name resolution of columns in a > HAVING clause. > Here is a scenario of the problem: > {code} > // A simplified version of TC 01.13 from PR-16337 > Seq((1,1,1)).toDF("t1a", "t1b", "t1c").createOrReplaceTempView("t1") > Seq((1,1,1)).toDF("t2a", "t2b", "t2c").createOrReplaceTempView("t2") > // This is okay. > // Error: t2c is unresolved > sql("select t2a from t2 group by t2a having t2c = 8").show > // This is okay as t2c is resolved to the t2 on the parent side > // because t2 in the subquery does not output column t2c. > sql("select * from t2 where t2a in (select t2a from (select t2a from t2) t2 > group by t2a having t2c = 8)").explain(true) > // This is the problem. > sql("select * from t2 where t2a in (select t2a from t2 group by t2a having > t2c = 8)").explain(true) > == Analyzed Logical Plan == > t2a: int, t2b: int, t2c: int > Project [t2a#22, t2b#23, t2c#24] > +- Filter predicate-subquery#38 [(t2a#22 = t2a#22#49) && (t2c#24 = 8)] >: +- Project [t2a#22 AS t2a#22#49] >: +- Aggregate [t2a#22], [t2a#22] >:+- SubqueryAlias t2, `t2` >: +- Project [_1#18 AS t2a#22, _2#19 AS t2b#23, _3#20 AS t2c#24] >: +- LocalRelation [_1#18, _2#19, _3#20] >+- SubqueryAlias t2, `t2` > +- Project [_1#18 AS t2a#22, _2#19 AS t2b#23, _3#20 AS t2c#24] > +- LocalRelation [_1#18, _2#19, _3#20] > {code} > We should not resolve {{t2c}} in the subquery to the outer {{t2}} on the > parent side. It should try to resolve {{t2c}} to the {{t2}} in the subquery > from its current scope and raise an exception because it is invalid to pull > up the column {{t2c}} from the {{Aggregate}} operator below. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19099) Wrong time display on Spark History Server web UI
[ https://issues.apache.org/jira/browse/SPARK-19099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19099: -- Target Version/s: (was: 2.1.0) Labels: (was: none) Fix Version/s: (was: 2.1.1) > Wrong time display on Spark History Server web UI > - > > Key: SPARK-19099 > URL: https://issues.apache.org/jira/browse/SPARK-19099 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.1.0 >Reporter: JohnsonZhang >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > While using the spark history server, I got a wrong job start time and end > time. I tracked the reason and found it's because the hard coding of TimeZone > rawOffSet. > I've changed it and acquire the offset value from System. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name
[ https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804545#comment-15804545 ] Peter Parente commented on SPARK-19105: --- >From chat in the PR, the keytab on AM does have the proper UUID suffix. The >HDFS staging area filename is a red herring and a misunderstanding on my part >about where the files are first distributed. > yarn/Client.scala copyToRemote does not include keytab destination name > --- > > Key: SPARK-19105 > URL: https://issues.apache.org/jira/browse/SPARK-19105 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: YARN in client mode >Reporter: Peter Parente > > When I specify --principal user@REALM and --keytab /some/path/user.keytab, I > see the following in my app staging directory on HDFS: > {code} > -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab > -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip > -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 > __spark_libs__4440821503780683972.zip > -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip > -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip > {code} > I also see that my spark.yarn.keytab config value has changed to > user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure > the keytab is unique within the app staging directory. However, from the > directory listing above, it's clear that the file written does not match this > new name. As a result, when it comes time to renew the Kerberos ticket, > AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed > name and also fails to renew the tickets. > The problem looks to be in one call to [copyFileToRemote in > yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] > that leaves off the destination filename param. The other calls in that > object which use copyFileToRemote and have a custom destination name all > provide this parameter (e.g., > https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name
[ https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Parente closed SPARK-19105. - Resolution: Invalid > yarn/Client.scala copyToRemote does not include keytab destination name > --- > > Key: SPARK-19105 > URL: https://issues.apache.org/jira/browse/SPARK-19105 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: YARN in client mode >Reporter: Peter Parente > > When I specify --principal user@REALM and --keytab /some/path/user.keytab, I > see the following in my app staging directory on HDFS: > {code} > -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab > -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip > -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 > __spark_libs__4440821503780683972.zip > -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip > -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip > {code} > I also see that my spark.yarn.keytab config value has changed to > user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure > the keytab is unique within the app staging directory. However, from the > directory listing above, it's clear that the file written does not match this > new name. As a result, when it comes time to renew the Kerberos ticket, > AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed > name and also fails to renew the tickets. > The problem looks to be in one call to [copyFileToRemote in > yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] > that leaves off the destination filename param. The other calls in that > object which use copyFileToRemote and have a custom destination name all > provide this parameter (e.g., > https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name
[ https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19105: Assignee: (was: Apache Spark) > yarn/Client.scala copyToRemote does not include keytab destination name > --- > > Key: SPARK-19105 > URL: https://issues.apache.org/jira/browse/SPARK-19105 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: YARN in client mode >Reporter: Peter Parente > > When I specify --principal user@REALM and --keytab /some/path/user.keytab, I > see the following in my app staging directory on HDFS: > {code} > -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab > -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip > -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 > __spark_libs__4440821503780683972.zip > -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip > -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip > {code} > I also see that my spark.yarn.keytab config value has changed to > user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure > the keytab is unique within the app staging directory. However, from the > directory listing above, it's clear that the file written does not match this > new name. As a result, when it comes time to renew the Kerberos ticket, > AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed > name and also fails to renew the tickets. > The problem looks to be in one call to [copyFileToRemote in > yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] > that leaves off the destination filename param. The other calls in that > object which use copyFileToRemote and have a custom destination name all > provide this parameter (e.g., > https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name
[ https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Parente updated SPARK-19105: -- Comment: was deleted (was: Related PR https://github.com/apache/spark/pull/16482) > yarn/Client.scala copyToRemote does not include keytab destination name > --- > > Key: SPARK-19105 > URL: https://issues.apache.org/jira/browse/SPARK-19105 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: YARN in client mode >Reporter: Peter Parente > > When I specify --principal user@REALM and --keytab /some/path/user.keytab, I > see the following in my app staging directory on HDFS: > {code} > -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab > -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip > -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 > __spark_libs__4440821503780683972.zip > -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip > -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip > {code} > I also see that my spark.yarn.keytab config value has changed to > user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure > the keytab is unique within the app staging directory. However, from the > directory listing above, it's clear that the file written does not match this > new name. As a result, when it comes time to renew the Kerberos ticket, > AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed > name and also fails to renew the tickets. > The problem looks to be in one call to [copyFileToRemote in > yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] > that leaves off the destination filename param. The other calls in that > object which use copyFileToRemote and have a custom destination name all > provide this parameter (e.g., > https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name
[ https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804475#comment-15804475 ] Apache Spark commented on SPARK-19105: -- User 'parente' has created a pull request for this issue: https://github.com/apache/spark/pull/16482 > yarn/Client.scala copyToRemote does not include keytab destination name > --- > > Key: SPARK-19105 > URL: https://issues.apache.org/jira/browse/SPARK-19105 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: YARN in client mode >Reporter: Peter Parente > > When I specify --principal user@REALM and --keytab /some/path/user.keytab, I > see the following in my app staging directory on HDFS: > {code} > -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab > -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip > -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 > __spark_libs__4440821503780683972.zip > -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip > -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip > {code} > I also see that my spark.yarn.keytab config value has changed to > user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure > the keytab is unique within the app staging directory. However, from the > directory listing above, it's clear that the file written does not match this > new name. As a result, when it comes time to renew the Kerberos ticket, > AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed > name and also fails to renew the tickets. > The problem looks to be in one call to [copyFileToRemote in > yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] > that leaves off the destination filename param. The other calls in that > object which use copyFileToRemote and have a custom destination name all > provide this parameter (e.g., > https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name
[ https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19105: Assignee: Apache Spark > yarn/Client.scala copyToRemote does not include keytab destination name > --- > > Key: SPARK-19105 > URL: https://issues.apache.org/jira/browse/SPARK-19105 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: YARN in client mode >Reporter: Peter Parente >Assignee: Apache Spark > > When I specify --principal user@REALM and --keytab /some/path/user.keytab, I > see the following in my app staging directory on HDFS: > {code} > -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab > -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip > -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 > __spark_libs__4440821503780683972.zip > -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip > -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip > {code} > I also see that my spark.yarn.keytab config value has changed to > user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure > the keytab is unique within the app staging directory. However, from the > directory listing above, it's clear that the file written does not match this > new name. As a result, when it comes time to renew the Kerberos ticket, > AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed > name and also fails to renew the tickets. > The problem looks to be in one call to [copyFileToRemote in > yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] > that leaves off the destination filename param. The other calls in that > object which use copyFileToRemote and have a custom destination name all > provide this parameter (e.g., > https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name
[ https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804476#comment-15804476 ] Peter Parente commented on SPARK-19105: --- Related PR https://github.com/apache/spark/pull/16482 > yarn/Client.scala copyToRemote does not include keytab destination name > --- > > Key: SPARK-19105 > URL: https://issues.apache.org/jira/browse/SPARK-19105 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: YARN in client mode >Reporter: Peter Parente > > When I specify --principal user@REALM and --keytab /some/path/user.keytab, I > see the following in my app staging directory on HDFS: > {code} > -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab > -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip > -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 > __spark_libs__4440821503780683972.zip > -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip > -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip > {code} > I also see that my spark.yarn.keytab config value has changed to > user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure > the keytab is unique within the app staging directory. However, from the > directory listing above, it's clear that the file written does not match this > new name. As a result, when it comes time to renew the Kerberos ticket, > AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed > name and also fails to renew the tickets. > The problem looks to be in one call to [copyFileToRemote in > yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] > that leaves off the destination filename param. The other calls in that > object which use copyFileToRemote and have a custom destination name all > provide this parameter (e.g., > https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name
[ https://issues.apache.org/jira/browse/SPARK-19105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Parente updated SPARK-19105: -- Description: When I specify --principal user@REALM and --keytab /some/path/user.keytab, I see the following in my app staging directory on HDFS: {code} -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 __spark_libs__4440821503780683972.zip -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip {code} I also see that my spark.yarn.keytab config value has changed to user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure the keytab is unique within the app staging directory. However, from the directory listing above, it's clear that the file written does not match this new name. As a result, when it comes time to renew the Kerberos ticket, AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed name and also fails to renew the tickets. The problem looks to be in one call to [copyFileToRemote in yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] that leaves off the destination filename param. The other calls in that object which use copyFileToRemote and have a custom destination name all provide this parameter (e.g., https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). was: When I specify {{monospaced}}--principal user@REALM{{monospaced}} and {{monospaced}}--keytab /some/path/user.keytab{{monospaced}}, I see the following in my app staging directory on HDFS: {code} -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 __spark_libs__4440821503780683972.zip -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip {code} I also see that my spark.yarn.keytab config value has changed to user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure the keytab is unique within the app staging directory. However, from the directory listing above, it's clear that the file written does not match this new name. As a result, when it comes time to renew the Kerberos ticket, AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed name and also fails to renew the tickets. The problem looks to be in one call to [copyFileToRemote in yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] that leaves off the destination filename param. The other calls in that object which use copyFileToRemote and have a custom destination name all provide this parameter (e.g., https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). > yarn/Client.scala copyToRemote does not include keytab destination name > --- > > Key: SPARK-19105 > URL: https://issues.apache.org/jira/browse/SPARK-19105 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 > Environment: YARN in client mode >Reporter: Peter Parente > > When I specify --principal user@REALM and --keytab /some/path/user.keytab, I > see the following in my app staging directory on HDFS: > {code} > -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab > -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip > -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 > __spark_libs__4440821503780683972.zip > -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip > -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip > {code} > I also see that my spark.yarn.keytab config value has changed to > user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure > the keytab is unique within the app staging directory. However, from the > directory listing above, it's clear that the file written does not match this > new name. As a result, when it comes time to renew the Kerberos ticket, > AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed > name and also fails to
[jira] [Created] (SPARK-19105) yarn/Client.scala copyToRemote does not include keytab destination name
Peter Parente created SPARK-19105: - Summary: yarn/Client.scala copyToRemote does not include keytab destination name Key: SPARK-19105 URL: https://issues.apache.org/jira/browse/SPARK-19105 Project: Spark Issue Type: Bug Affects Versions: 2.1.0 Environment: YARN in client mode Reporter: Peter Parente When I specify {{monospaced}}--principal user@REALM{{monospaced}} and {{monospaced}}--keytab /some/path/user.keytab{{monospaced}}, I see the following in my app staging directory on HDFS: {code} -rw-r--r-- 3 user supergroup 68 2017-01-06 03:59 user.keytab -rw-r--r-- 3 user supergroup 73502 2017-01-06 03:59 __spark_conf__.zip -rw-r--r-- 3 user supergroup 189767340 2017-01-06 03:59 __spark_libs__4440821503780683972.zip -rw-r--r-- 3 user supergroup 91275 2017-01-06 03:59 py4j-0.10.3-src.zip -rw-r--r-- 3 user supergroup 440385 2017-01-06 03:59 pyspark.zip {code} I also see that my spark.yarn.keytab config value has changed to user.keytab-54ee5192-43d0-41b5-ba50-1181ece26961 by the yarn client to ensure the keytab is unique within the app staging directory. However, from the directory listing above, it's clear that the file written does not match this new name. As a result, when it comes time to renew the Kerberos ticket, AMDelegationTokenRenewer fails to find the keytab under the UUID-suffixed name and also fails to renew the tickets. The problem looks to be in one call to [copyFileToRemote in yarn/Client.java|https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L482] that leaves off the destination filename param. The other calls in that object which use copyFileToRemote and have a custom destination name all provide this parameter (e.g., https://github.com/apache/spark/blob/fe1c895e16c475a6f271ce600a42a8d0dc7986e5/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L652). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9215) Implement WAL-free Kinesis receiver that give at-least once guarantee
[ https://issues.apache.org/jira/browse/SPARK-9215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804425#comment-15804425 ] Gaurav Shah commented on SPARK-9215: [~tdas] I know this is an old pull request but was still wondering if you can help. I was wondering can we enhance this to make sure that we checkpoint only after blocks of data has been written. So we need to implement Spark checkpoint in the first place. Each block has a start and end seq number. > Implement WAL-free Kinesis receiver that give at-least once guarantee > - > > Key: SPARK-9215 > URL: https://issues.apache.org/jira/browse/SPARK-9215 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 1.4.1 >Reporter: Tathagata Das >Assignee: Tathagata Das > Fix For: 1.5.0 > > > Currently, the KinesisReceiver can loose some data in the case of certain > failures (receiver and driver failures). Using the write ahead logs can > mitigate some of the problem, but it is not ideal because WALs dont work with > S3 (eventually consistency, etc.) which is the most likely file system to be > used in the EC2 environment. Hence, we have to take a different approach to > improving reliability for Kinesis. > Detailed design doc - > https://docs.google.com/document/d/1k0dl270EnK7uExrsCE7jYw7PYx0YC935uBcxn3p0f58/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0
Nils Grabbert created SPARK-19104: - Summary: CompileException with Map and Case Class in Spark 2.1.0 Key: SPARK-19104 URL: https://issues.apache.org/jira/browse/SPARK-19104 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Nils Grabbert The following code will run with Spark 2.0.2 but not with Spark 2.1.0: {code} case class InnerData(name: String, value: Int) case class Data(id: Int, param: Map[String, InnerData]) val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 100 val ds = spark.createDataset(data) {code} Exception: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, Column 46: Expression "ExternalMapToCatalyst_value_isNull1" is not an rvalue at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639) at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984) at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) at org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) at org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) at org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:196) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:91) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:935) ... 77 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Commented] (SPARK-18997) Recommended upgrade libthrift to 0.9.3
[ https://issues.apache.org/jira/browse/SPARK-18997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804135#comment-15804135 ] Sean Owen commented on SPARK-18997: --- Help with what, opening a PR? http://spark.apache.org/contributing.html You can look at the libthrift changes from the project itself to assess what changed. IIRC it was a lot. > Recommended upgrade libthrift to 0.9.3 > --- > > Key: SPARK-18997 > URL: https://issues.apache.org/jira/browse/SPARK-18997 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: meiyoula >Priority: Minor > > libthrift 0.9.2 has a serious security vulnerability:CVE-2015-3254 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19097) virtualenv example failed with conda due to ImportError: No module named ruamel.yaml.comments
[ https://issues.apache.org/jira/browse/SPARK-19097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19097. --- Resolution: Duplicate I don't see value in opening a bunch of JIRAs on the same theme. These look like near duplicates and depend on this functionality being supported in the first place, which isn't apparently supported according to the parent. > virtualenv example failed with conda due to ImportError: No module named > ruamel.yaml.comments > - > > Key: SPARK-19097 > URL: https://issues.apache.org/jira/browse/SPARK-19097 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Yesha Vora > > Spark version : 2 > Steps: > * install conda on all nodes (python2.7) ( pip install conda ) > * create requirement1.txt with "numpy > requirement1.txt " > * Run kmeans.py application in yarn-client mode. > {code} > spark-submit --master yarn --deploy-mode client --conf > "spark.pyspark.virtualenv.enabled=true" --conf > "spark.pyspark.virtualenv.type=conda" --conf > "spark.pyspark.virtualenv.requirements=/tmp/requirements1.txt" --conf > "spark.pyspark.virtualenv.bin.path=/usr/bin/conda" --jars > /usr/hadoop-client/lib/hadoop-lzo.jar kmeans.py /tmp/in/kmeans_data.txt > 3{code} > {code:title=app log} > 17/01/06 01:39:25 DEBUG PythonWorkerFactory: user.home=/home/yarn > 17/01/06 01:39:25 DEBUG PythonWorkerFactory: Running command:/usr/bin/conda > create --prefix > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1483592608863_0017/container_1483592608863_0017_01_03/virtualenv_application_1483592608863_0017_0 > --file requirements1.txt -y > Traceback (most recent call last): > File "/usr/bin/conda", line 11, in > load_entry_point('conda==4.2.7', 'console_scripts', 'conda')() > File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line > 561, in load_entry_point > return get_distribution(dist).load_entry_point(group, name) > File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line > 2631, in load_entry_point > return ep.load() > File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line > 2291, in load > return self.resolve() > File "/usr/lib/python2.7/site-packages/pkg_resources/__init__.py", line > 2297, in resolve > module = __import__(self.module_name, fromlist=['__name__'], level=0) > File "/usr/lib/python2.7/site-packages/conda/cli/__init__.py", line 8, in > > from .main import main # NOQA > File "/usr/lib/python2.7/site-packages/conda/cli/main.py", line 46, in > > from ..base.context import context > File "/usr/lib/python2.7/site-packages/conda/base/context.py", line 18, in > > from ..common.configuration import (Configuration, MapParameter, > PrimitiveParameter, > File "/usr/lib/python2.7/site-packages/conda/common/configuration.py", line > 40, in > from ruamel.yaml.comments import CommentedSeq, CommentedMap # pragma: no > cover > ImportError: No module named ruamel.yaml.comments > 17/01/06 01:39:26 WARN BlockManager: Putting block rdd_3_0 failed due to an > exception > 17/01/06 01:39:26 WARN BlockManager: Block rdd_3_0 could not be removed as it > was not found on disk or in memory > 17/01/06 01:39:26 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.RuntimeException: Fail to run command: /usr/bin/conda create > --prefix > /grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1483592608863_0017/container_1483592608863_0017_01_03/virtualenv_application_1483592608863_0017_0 > --file requirements1.txt -y > at > org.apache.spark.api.python.PythonWorkerFactory.execCommand(PythonWorkerFactory.scala:142) > at > org.apache.spark.api.python.PythonWorkerFactory.setupVirtualEnv(PythonWorkerFactory.scala:124) > at > org.apache.spark.api.python.PythonWorkerFactory.(PythonWorkerFactory.scala:70) > at > org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:117) > at > org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:117) > at > scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194) > at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116) > at > org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Updated] (SPARK-19102) Accuracy error of spark SQL results
[ https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaodongCui updated SPARK-19102: Description: the problem is cube6's second column named sumprice is 1 times bigger than the cube5's second column named sumprice,but they should be equal .the bug is only reappear in the format like sum(a * b),count (distinct c) code: DataFrame df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a"); df1.registerTempTable("hd_salesflat"); DataFrame cube5 = sqlContext.sql("SELECT areacode1, SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1"); DataFrame cube6 = sqlContext.sql("SELECT areacode1, SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno) FROM hd_salesflat GROUP BY areacode1"); cube5.show(50); cube6.show(50); my data: transno | quantity | unitprice | areacode1 76317828| 1. | 25. | HDCN data schema: |-- areacode1: string (nullable = true) |-- quantity: decimal(20,4) (nullable = true) |-- unitprice: decimal(20,4) (nullable = true) |-- transno: string (nullable = true) was: the problem is cube6's second column named sumprice is 1 times bigger than the cube5's second column named sumprice,but they should be equal .the bug is only reappear in the format like sum(a * b),count (distinct c) DataFrame df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a"); df1.registerTempTable("hd_salesflat"); DataFrame cube5 = sqlContext.sql("SELECT areacode1, SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1"); DataFrame cube6 = sqlContext.sql("SELECT areacode1, SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno) FROM hd_salesflat GROUP BY areacode1"); cube5.show(50); cube6.show(50); my data: transno | quantity | unitprice | areacode1 76317828| 1. | 25. | HDCN data schema: |-- areacode1: string (nullable = true) |-- quantity: decimal(20,4) (nullable = true) |-- unitprice: decimal(20,4) (nullable = true) |-- transno: string (nullable = true) > Accuracy error of spark SQL results > --- > > Key: SPARK-19102 > URL: https://issues.apache.org/jira/browse/SPARK-19102 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.0, 1.6.1 > Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6 >Reporter: XiaodongCui > Attachments: a.zip > > > the problem is cube6's second column named sumprice is 1 times bigger > than the cube5's second column named sumprice,but they should be equal .the > bug is only reappear in the format like sum(a * b),count (distinct c) > code: > > DataFrame > df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a"); > df1.registerTempTable("hd_salesflat"); > DataFrame cube5 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1"); > DataFrame cube6 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno) FROM > hd_salesflat GROUP BY areacode1"); > cube5.show(50); > cube6.show(50); > > my data: > transno | quantity | unitprice | areacode1 > 76317828| 1. | 25. | HDCN > data schema: > |-- areacode1: string (nullable = true) > |-- quantity: decimal(20,4) (nullable = true) > |-- unitprice: decimal(20,4) (nullable = true) > |-- transno: string (nullable = true) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19102) Accuracy error of spark SQL results
[ https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaodongCui reopened SPARK-19102: - the data under the path :hdfs://cdh01:8020/sandboxdata_A/test/a in the attach file > Accuracy error of spark SQL results > --- > > Key: SPARK-19102 > URL: https://issues.apache.org/jira/browse/SPARK-19102 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.0, 1.6.1 > Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6 >Reporter: XiaodongCui > Attachments: a.zip > > > the problem is cube6's second column named sumprice is 1 times bigger > than the cube5's second column named sumprice,but they should be equal .the > bug is only reappear in the format like sum(a * b),count (distinct c) > DataFrame > df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a"); > df1.registerTempTable("hd_salesflat"); > DataFrame cube5 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1"); > DataFrame cube6 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno) FROM > hd_salesflat GROUP BY areacode1"); > cube5.show(50); > cube6.show(50); > my data: > transno | quantity | unitprice | areacode1 > 76317828| 1. | 25. | HDCN > data schema: > |-- areacode1: string (nullable = true) > |-- quantity: decimal(20,4) (nullable = true) > |-- unitprice: decimal(20,4) (nullable = true) > |-- transno: string (nullable = true) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19101) Spark Beeline catch a exeception when run command " load data inpath '/data/test/test.csv' overwrite into table db.test partition(area='021')"
[ https://issues.apache.org/jira/browse/SPARK-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804113#comment-15804113 ] Sean Owen commented on SPARK-19101: --- This is more of a Hive error, and suggests something else went wrong earlier ("filesystem closed"). By itself I don't think this is actionable. > Spark Beeline catch a exeception when run command " load data inpath > '/data/test/test.csv' overwrite into table db.test partition(area='021')" > --- > > Key: SPARK-19101 > URL: https://issues.apache.org/jira/browse/SPARK-19101 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: spark2.0.1 >Reporter: Xiaochen Ouyang > > firtstly,two commands as follow: > 1:load data inpath '/data/test/lte_cm_projdata_52.csv' overwrite into > table db.lte_cm_projdata partition(p_provincecode=52); > 2:load data local inpath '/home/mr/lte_cm_projdata_52.csv' overwrite > into table db.lte_cm_projdata partition(p_provincecode=52); > the first command run failed,but the second command run success. > beeline execption: > 0: jdbc:hive2://10.43.156.221:18000> load data inpath > '/data/test/lte_cm_projdata_52.csv' overwrite into table > db.lte_cm_projdata partition(p_provincecode=52); > Error: java.lang.reflect.InvocationTargetException (state=,code=0) > ThriftServer2 logs : > 2017-01-06 15:16:16,518 INFO HiveMetaStore: 58: get_partition_with_auth : > db=zxvmax tbl=lte_cm_projdata[52] > 2017-01-06 15:16:16,518 INFO audit: ugi=rootip=unknown-ip-addr > cmd=get_partition_with_auth : db=zxvmax tbl=lte_cm_projdata[52] > 2017-01-06 15:16:16,611 ERROR SparkExecuteStatementOperation: Error executing > query, currentState RUNNING, > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.sql.hive.client.Shim_v0_14.loadPartition(HiveShim.scala:622) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply$mcV$sp(HiveClientImpl.scala:635) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply(HiveClientImpl.scala:635) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadPartition$1.apply(HiveClientImpl.scala:635) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269) > at > org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:634) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply$mcV$sp(HiveExternalCatalog.scala:279) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply(HiveExternalCatalog.scala:271) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadPartition$1.apply(HiveExternalCatalog.scala:271) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:271) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadPartition(SessionCatalog.scala:317) > at > org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:325) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at >
[jira] [Updated] (SPARK-19102) Accuracy error of spark SQL results
[ https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaodongCui updated SPARK-19102: Description: the problem is cube6's second column named sumprice is 1 times bigger than the cube5's second column named sumprice,but they should be equal .the bug is only reappear in the format like sum(a * b),count (distinct c) DataFrame df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a"); df1.registerTempTable("hd_salesflat"); DataFrame cube5 = sqlContext.sql("SELECT areacode1, SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1"); DataFrame cube6 = sqlContext.sql("SELECT areacode1, SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno) FROM hd_salesflat GROUP BY areacode1"); cube5.show(50); cube6.show(50); my data: transno | quantity | unitprice | areacode1 76317828| 1. | 25. | HDCN data schema: |-- areacode1: string (nullable = true) |-- quantity: decimal(20,4) (nullable = true) |-- unitprice: decimal(20,4) (nullable = true) |-- transno: string (nullable = true) was: the problem is the result of the code blow that the second column's value is not the same.the second sql result is 1 times bigger than the first sql result.the bug is only reappear in the format like sum(a * b),count (distinct c) DataFrame df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a"); df1.registerTempTable("hd_salesflat"); DataFrame cube5 = sqlContext.sql("SELECT areacode1, SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1"); DataFrame cube6 = sqlContext.sql("SELECT areacode1, SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno) FROM hd_salesflat GROUP BY areacode1"); cube5.show(50); cube6.show(50); my data: transno | quantity | unitprice | areacode1 76317828| 1. | 25. | HDCN data schema: |-- areacode1: string (nullable = true) |-- quantity: decimal(20,4) (nullable = true) |-- unitprice: decimal(20,4) (nullable = true) |-- transno: string (nullable = true) > Accuracy error of spark SQL results > --- > > Key: SPARK-19102 > URL: https://issues.apache.org/jira/browse/SPARK-19102 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.0, 1.6.1 > Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6 >Reporter: XiaodongCui > Attachments: a.zip > > > the problem is cube6's second column named sumprice is 1 times bigger > than the cube5's second column named sumprice,but they should be equal .the > bug is only reappear in the format like sum(a * b),count (distinct c) > DataFrame > df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a"); > df1.registerTempTable("hd_salesflat"); > DataFrame cube5 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1"); > DataFrame cube6 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno) FROM > hd_salesflat GROUP BY areacode1"); > cube5.show(50); > cube6.show(50); > my data: > transno | quantity | unitprice | areacode1 > 76317828| 1. | 25. | HDCN > data schema: > |-- areacode1: string (nullable = true) > |-- quantity: decimal(20,4) (nullable = true) > |-- unitprice: decimal(20,4) (nullable = true) > |-- transno: string (nullable = true) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19102) Accuracy error of spark SQL results
[ https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiaodongCui updated SPARK-19102: Attachment: a.zip the attach file is my data,the data is parquet format > Accuracy error of spark SQL results > --- > > Key: SPARK-19102 > URL: https://issues.apache.org/jira/browse/SPARK-19102 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.0, 1.6.1 > Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6 >Reporter: XiaodongCui > Attachments: a.zip > > > the problem is the result of the code blow that the second column's value > is not the same.the second sql result is 1 times bigger than the first > sql result.the bug is only reappear in the format like sum(a * b),count > (distinct c) > DataFrame > df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a"); > df1.registerTempTable("hd_salesflat"); > DataFrame cube5 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1"); > DataFrame cube6 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno) FROM > hd_salesflat GROUP BY areacode1"); > cube5.show(50); > cube6.show(50); > my data: > transno | quantity | unitprice | areacode1 > 76317828| 1. | 25. | HDCN > data schema: > |-- areacode1: string (nullable = true) > |-- quantity: decimal(20,4) (nullable = true) > |-- unitprice: decimal(20,4) (nullable = true) > |-- transno: string (nullable = true) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19102) Accuracy error of spark SQL results
[ https://issues.apache.org/jira/browse/SPARK-19102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19102. --- Resolution: Invalid This doesn't describe a problem clearly. There's no data, no specifics about 'accuracy' and some unclear reference to something being 1 times bigger. > Accuracy error of spark SQL results > --- > > Key: SPARK-19102 > URL: https://issues.apache.org/jira/browse/SPARK-19102 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.6.0, 1.6.1 > Environment: Spark 1.6.0, Hadoop 2.6.0,JDK 1.8,CentOS6.6 >Reporter: XiaodongCui > > the problem is the result of the code blow that the second column's value > is not the same.the second sql result is 1 times bigger than the first > sql result.the bug is only reappear in the format like sum(a * b),count > (distinct c) > DataFrame > df1=sqlContext.read().parquet("hdfs://cdh01:8020/sandboxdata_A/test/a"); > df1.registerTempTable("hd_salesflat"); > DataFrame cube5 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice FROM hd_salesflat GROUP BY areacode1"); > DataFrame cube6 = sqlContext.sql("SELECT areacode1, > SUM(quantity*unitprice) AS sumprice, COUNT(DISTINCT transno) FROM > hd_salesflat GROUP BY areacode1"); > cube5.show(50); > cube6.show(50); > my data: > transno | quantity | unitprice | areacode1 > 76317828| 1. | 25. | HDCN > data schema: > |-- areacode1: string (nullable = true) > |-- quantity: decimal(20,4) (nullable = true) > |-- unitprice: decimal(20,4) (nullable = true) > |-- transno: string (nullable = true) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19095) virtualenv example does not work in yarn cluster mode
[ https://issues.apache.org/jira/browse/SPARK-19095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19095. --- Resolution: Duplicate If this isn't something supported yet, it's not a bug, and I'd resolve this as a duplciate of the parent. > virtualenv example does not work in yarn cluster mode > - > > Key: SPARK-19095 > URL: https://issues.apache.org/jira/browse/SPARK-19095 > Project: Spark > Issue Type: Sub-task >Reporter: Yesha Vora >Priority: Critical > > Spark version: 2 > Steps: > * install virtualenv on all nodes > * create requirement1.txt with "numpy > requirement1.txt " > * Run kmeans.py application in yarn-cluster mode. > {code} > spark-submit --master yarn --deploy-mode cluster --conf > "spark.pyspark.virtualenv.enabled=true" --conf > "spark.pyspark.virtualenv.type=native" --conf > "spark.pyspark.virtualenv.requirements=/tmp/requirements1.txt" --conf > "spark.pyspark.virtualenv.bin.path=/usr/bin/virtualenv" --jars > /usr/hdp/current/hadoop-client/lib/hadoop-lzo.jar kmeans.py > /tmp/in/kmeans_data.txt 3{code} > The application fails to find numpy. > {code} > LogType:stdout > Log Upload Time:Thu Jan 05 20:05:49 + 2017 > LogLength:134 > Log Contents: > Traceback (most recent call last): > File "kmeans.py", line 27, in > import numpy as np > ImportError: No module named numpy > End of LogType:stdout > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19096) Kmeans.py application fails with virtualenv and due to parse error
[ https://issues.apache.org/jira/browse/SPARK-19096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19096. --- Resolution: Duplicate > Kmeans.py application fails with virtualenv and due to parse error > > > Key: SPARK-19096 > URL: https://issues.apache.org/jira/browse/SPARK-19096 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Yesha Vora > > Spark version : 2 > Steps: > * Install virtualenv ( pip install virtualenv) > * create requirements.txt (pip freeze > /tmp/requirements.txt) > * start kmeans.py application in yarn-client mode. > The application fails with Runtime Exception > {code:title=app log} > 17/01/05 19:49:59 INFO deprecation: mapred.task.partition is deprecated. > Instead, use mapreduce.task.partition > 17/01/05 19:49:59 INFO deprecation: mapred.job.id is deprecated. Instead, use > mapreduce.job.id > Invalid requirement: 'pip freeze' > Traceback (most recent call last): > File > "/grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1483592608863_0006/container_1483592608863_0006_01_02/virtualenv_application_1483592608863_0006_0/lib/python2.7/site-packages/pip/req/req_install.py", > line 82, in __init__ > req = Requirement(req) > File > "/grid/0/hadoop/yarn/local/usercache/hrt_qa/appcache/application_1483592608863_0006/container_1483592608863_0006_01_02/virtualenv_application_1483592608863_0006_0/lib/python2.7/site-packages/pip/_vendor/packaging/requirements.py", > line 96, in __init__ > requirement_string[e.loc:e.loc + 8])) > InvalidRequirement: Invalid requirement, parse error at "u'freeze'" > 17/01/05 19:50:03 WARN BlockManager: Putting block rdd_3_0 failed due to an > exception > 17/01/05 19:50:03 WARN BlockManager: Block rdd_3_0 could not be removed as it > was not found on disk or in memory > 17/01/05 19:50:03 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > {code} > {code:title=job client log} > 17/01/05 19:50:07 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 2, > xxx.site, executor 1): java.lang.RuntimeException: Fail to run command: > virtualenv_application_1483592608863_0006_1/bin/python -m pip --cache-dir > /home/yarn install -r requirements.txt > at > org.apache.spark.api.python.PythonWorkerFactory.execCommand(PythonWorkerFactory.scala:142) > at > org.apache.spark.api.python.PythonWorkerFactory.setupVirtualEnv(PythonWorkerFactory.scala:128) > at > org.apache.spark.api.python.PythonWorkerFactory.(PythonWorkerFactory.scala:70) > at > org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:117) > at > org.apache.spark.SparkEnv$$anonfun$createPythonWorker$1.apply(SparkEnv.scala:117) > at > scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194) > at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at
[jira] [Updated] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
[ https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19098: -- Priority: Minor (was: Critical) > Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations > - > > Key: SPARK-19098 > URL: https://issues.apache.org/jira/browse/SPARK-19098 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.1.0 > Environment: Linux x64 > Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0) > Spark on YARN, dynamic allocation with shuffle service > Input/Output data on HDFS > kryo serialization turned on > checkpointing directory set on HDFS >Reporter: Steven Ruppert >Priority: Minor > Attachments: doubling-season.png > > > I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla > ConnectedComponents use, notably one that works fine with identical code on > spark 2.0.1, but not on 2.1.0. > I unfortunately haven't narrowed this down to a test case yet, nor do I have > access to the original logs, so this initial report will be a little vague. > However, this behavior as described might ring a bell to somebody. > Roughly: > {noformat} > val edges: RDD[Edge[Int]] = _ // from file > val vertices: RDD[(VertexId, Int)] = _ // from file > val graph = Graph(vertices, edges) > val components: RDD[(VertexId, ComponentId)] = ConnectedComponents > .run(graph, 10) > .vertices > {noformat} > Running this against my input of ~5B edges and ~3B vertices leads to a > strange doubling of shuffle traffic in each round of Pregel (inside > ConnectedComponents), increasing from the actual data size of ~50 GB, to > 100GB, to 200GB, all the way to around 40TB before I killed the job. The data > being shuffled was apparently an RDD of ShippableVertexPartition . > Oddly enough, only the kryo-serialized shuffled data doubled in size. The > heap usage of the executors themselves remained stable, or at least did not > account 1 to 1 for the 40TB of shuffled data, for I definitely do not have > 40TB of RAM. Furthermore, I also have kryo reference tracking turned on > still, so whatever is leaking somehow gets around that. > I'll update this ticket once I have more details, unless somebody else with > the same problem reports back first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19103) In web ui,URL's host name should be a specific IP address.
[ https://issues.apache.org/jira/browse/SPARK-19103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19103. --- Resolution: Invalid I can't understand what this means, but, it's not true that the UI should use only IP addresses > In web ui,URL's host name should be a specific IP address. > -- > > Key: SPARK-19103 > URL: https://issues.apache.org/jira/browse/SPARK-19103 > Project: Spark > Issue Type: Bug > Components: Web UI > Environment: spark 2.0.2 >Reporter: guoxiaolong >Priority: Minor > Attachments: 1.png, 2.png > > > In web ui,URL's host name should be a specific IP address.Because open URL > must be resolve host name.It can not find host name.So URL can not > find.Please see the attachment.Thank you! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org