[jira] [Issue Comment Deleted] (SPARK-20785) Spark should provide jump links and add (count) in the SQL web ui.
[ https://issues.apache.org/jira/browse/SPARK-20785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-20785: --- Comment: was deleted (was: [~srowen] Help deal with the issue) > Spark should provide jump links and add (count) in the SQL web ui. > --- > > Key: SPARK-20785 > URL: https://issues.apache.org/jira/browse/SPARK-20785 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > it provide links that jump to Running Queries,Completed Queries and Failed > Queries. > it add (count) about Running Queries,Completed Queries and Failed Queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20785) Spark should provide jump links and add (count) in the SQL web ui.
[ https://issues.apache.org/jira/browse/SPARK-20785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016989#comment-16016989 ] guoxiaolongzte commented on SPARK-20785: [~srowen] Help deal with the issue > Spark should provide jump links and add (count) in the SQL web ui. > --- > > Key: SPARK-20785 > URL: https://issues.apache.org/jira/browse/SPARK-20785 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > > it provide links that jump to Running Queries,Completed Queries and Failed > Queries. > it add (count) about Running Queries,Completed Queries and Failed Queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20806) Launcher:redundant code,invalid branch of judgment
[ https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20806. --- Resolution: Invalid It's hard to understand what you're describing. But I looked at the code you describe and the logic is fine. The argument failfNotFound is not always true/false. > Launcher:redundant code,invalid branch of judgment > -- > > Key: SPARK-20806 > URL: https://issues.apache.org/jira/browse/SPARK-20806 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Submit >Affects Versions: 2.1.1 >Reporter: Phoenix_Dad > > org.apache.spark.launcher.CommandBuilderUtils > In findJarsDir function, there is an “if or else” branch . > the first input argument of 'checkState' in 'if' subclause is always true, > so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20806) Launcher:redundant code,invalid branch of judgment
Phoenix_Dad created SPARK-20806: --- Summary: Launcher:redundant code,invalid branch of judgment Key: SPARK-20806 URL: https://issues.apache.org/jira/browse/SPARK-20806 Project: Spark Issue Type: Bug Components: Deploy, Spark Submit Affects Versions: 2.1.1 Reporter: Phoenix_Dad org.apache.spark.launcher.CommandBuilderUtils In findJarsDir function, there is an “if or else” branch . the first input argument of 'checkState' in 'if' subclause is always true, so 'checkState' is useless here -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20805) updated updateP in SVD++ is error
[ https://issues.apache.org/jira/browse/SPARK-20805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016969#comment-16016969 ] Sean Owen commented on SPARK-20805: --- This is hard to understand [~BoLing]. Can you open a PR with the proposed change, and explain the effect of the problem more clearly? > updated updateP in SVD++ is error > -- > > Key: SPARK-20805 > URL: https://issues.apache.org/jira/browse/SPARK-20805 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.1, 2.1.1 >Reporter: BoLing > > In algorithm svd++, we all known that the usr._2 store the value of pu + > |N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of > updateP,updateQ and updateY. At the beginning,the cycle iteration update the > part of y in usr._2, but pu is never updated. so we should fix the > sendMessageToSrcFunction in sendMsgTrainF. the old code is > ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * > conf.gamma1)). if we fix like that ctx.sendToSrc((updateP, updateP, (err - > conf.gamma6 * usr._3) * conf.gamma1)), it maybe arrive the effect we want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20805) updated updateP in SVD++ is error
BoLing created SPARK-20805: -- Summary: updated updateP in SVD++ is error Key: SPARK-20805 URL: https://issues.apache.org/jira/browse/SPARK-20805 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.1.1, 1.6.1 Reporter: BoLing In algorithm svd++, we all known that the usr._2 store the value of pu + |N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of updateP,updateQ and updateY. At the beginning,the cycle iteration update the part of y in usr._2, but pu is never updated. so we should fix the sendMessageToSrcFunction in sendMsgTrainF. the old code is ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * conf.gamma1)). if we fix like that ctx.sendToSrc((updateP, updateP, (err - conf.gamma6 * usr._3) * conf.gamma1)), it maybe arrive the effect we want. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20803) KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not
[ https://issues.apache.org/jira/browse/SPARK-20803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bettadapura Srinath Sharma updated SPARK-20803: --- Description: When data is NOT normally distributed (correct behavior): This code: vecRDD = sc.parallelize(colVec) kd = KernelDensity() kd.setSample(vecRDD) kd.setBandwidth(3.0) # Find density estimates for the given values densities = kd.estimate(samplePoints) produces: 17/05/18 15:40:36 INFO SparkContext: Starting job: aggregate at KernelDensity.scala:92 17/05/18 15:40:36 INFO DAGScheduler: Got job 21 (aggregate at KernelDensity.scala:92) with 1 output partitions 17/05/18 15:40:36 INFO DAGScheduler: Final stage: ResultStage 24 (aggregate at KernelDensity.scala:92) 17/05/18 15:40:36 INFO DAGScheduler: Parents of final stage: List() 17/05/18 15:40:36 INFO DAGScheduler: Missing parents: List() 17/05/18 15:40:36 INFO DAGScheduler: Submitting ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents 17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 6.6 KB, free 413.6 MB) 17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 3.6 KB, free 413.6 MB) 17/05/18 15:40:36 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 192.168.0.115:38645 (size: 3.6 KB, free: 413.9 MB) 17/05/18 15:40:36 INFO SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:996 17/05/18 15:40:36 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345) 17/05/18 15:40:36 INFO TaskSchedulerImpl: Adding task set 24.0 with 1 tasks 17/05/18 15:40:36 INFO TaskSetManager: Starting task 0.0 in stage 24.0 (TID 24, localhost, executor driver, partition 0, PROCESS_LOCAL, 96186 bytes) 17/05/18 15:40:36 INFO Executor: Running task 0.0 in stage 24.0 (TID 24) 17/05/18 15:40:37 INFO PythonRunner: Times: total = 66, boot = -1831, init = 1844, finish = 53 17/05/18 15:40:37 INFO Executor: Finished task 0.0 in stage 24.0 (TID 24). 2476 bytes result sent to driver 17/05/18 15:40:37 INFO DAGScheduler: ResultStage 24 (aggregate at KernelDensity.scala:92) finished in 1.001 s 17/05/18 15:40:37 INFO TaskSetManager: Finished task 0.0 in stage 24.0 (TID 24) in 1004 ms on localhost (executor driver) (1/1) 17/05/18 15:40:37 INFO TaskSchedulerImpl: Removed TaskSet 24.0, whose tasks have all completed, from pool 17/05/18 15:40:37 INFO DAGScheduler: Job 21 finished: aggregate at KernelDensity.scala:92, took 1.136263 s 17/05/18 15:40:37 INFO BlockManagerInfo: Removed broadcast_25_piece0 on 192.168.0.115:38645 in memory (size: 3.6 KB, free: 413.9 MB) 5.6654703477e-05,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001 ,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, But if Data IS normally distributed: I see: 17/05/18 15:50:16 ERROR Executor: Exception in task 0.0 in stage 24.0 (TID 24) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) On Scala, the correct result is: Code: vecRDD = sc.parallelize(colVec) kd = new KernelDensity().setSample(vecRDD).setBandwidth(3.0) // Find density estimates for the given values densities = kd.estimate(samplePoints) [0.04113814235801906,1.0994865517293571E-163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
[jira] [Created] (SPARK-20804) Join with null safe equality fails with AnalysisException
koert kuipers created SPARK-20804: - Summary: Join with null safe equality fails with AnalysisException Key: SPARK-20804 URL: https://issues.apache.org/jira/browse/SPARK-20804 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Environment: org.apache.spark#spark-sql_2.11;2.3.0-SNAPSHOT from asf snapshots, Mon May 15 08:09:18 EDT 2017 Reporter: koert kuipers Priority: Minor {noformat} val x = Seq(("a", 1), ("a", 2), (null, 1)).toDF("k", "v") val sums = x.groupBy($"k").agg(sum($"v") as "sum") x .join(sums, x("k") <=> sums("k")) .drop(sums("k")) .show {noformat} gives: {noformat} org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans Project [_2#54 AS v#57] +- LocalRelation [_1#53, _2#54] and Aggregate [k#69], [k#69, sum(cast(v#70 as bigint)) AS sum#65L] +- Project [_1#53 AS k#69, _2#54 AS v#70] +- LocalRelation [_1#53, _2#54] Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.; at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1081) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1078) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1078) at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1063) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:79) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:79) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:81) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2901) at org.apache.spark.sql.Dataset.head(Dataset.scala:2238) at org.apache.spark.sql.Dataset.take(Dataset.scala:2451) at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) at org.apache.spark.sql.Dataset.show(Dataset.scala:680) at org.apache.spark.sql.Dataset.show(Dataset.scala:639) at org.apache.spark.sql.Dataset.show(Dataset.scala:648) {noformat}
[jira] [Assigned] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.
[ https://issues.apache.org/jira/browse/SPARK-20801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20801: Assignee: Apache Spark > Store accurate size of blocks in MapStatus when it's above threshold. > - > > Key: SPARK-20801 > URL: https://issues.apache.org/jira/browse/SPARK-20801 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing >Assignee: Apache Spark > > Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is > used to store size of blocks. in HighlyCompressedMapStatus, only average size > is stored for non empty blocks. Which is not good for memory control when we > shuffle blocks. It makes sense to store the accurate size of block when it's > above threshold. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.
[ https://issues.apache.org/jira/browse/SPARK-20801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jin xing updated SPARK-20801: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-19659 > Store accurate size of blocks in MapStatus when it's above threshold. > - > > Key: SPARK-20801 > URL: https://issues.apache.org/jira/browse/SPARK-20801 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing > > Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is > used to store size of blocks. in HighlyCompressedMapStatus, only average size > is stored for non empty blocks. Which is not good for memory control when we > shuffle blocks. It makes sense to store the accurate size of block when it's > above threshold. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.
[ https://issues.apache.org/jira/browse/SPARK-20801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20801: Assignee: (was: Apache Spark) > Store accurate size of blocks in MapStatus when it's above threshold. > - > > Key: SPARK-20801 > URL: https://issues.apache.org/jira/browse/SPARK-20801 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing > > Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is > used to store size of blocks. in HighlyCompressedMapStatus, only average size > is stored for non empty blocks. Which is not good for memory control when we > shuffle blocks. It makes sense to store the accurate size of block when it's > above threshold. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.
[ https://issues.apache.org/jira/browse/SPARK-20801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016813#comment-16016813 ] Apache Spark commented on SPARK-20801: -- User 'jinxing64' has created a pull request for this issue: https://github.com/apache/spark/pull/18031 > Store accurate size of blocks in MapStatus when it's above threshold. > - > > Key: SPARK-20801 > URL: https://issues.apache.org/jira/browse/SPARK-20801 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: jin xing > > Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is > used to store size of blocks. in HighlyCompressedMapStatus, only average size > is stored for non empty blocks. Which is not good for memory control when we > shuffle blocks. It makes sense to store the accurate size of block when it's > above threshold. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016664#comment-16016664 ] Ruslan Dautkhanov commented on SPARK-18838: --- my 2 cents. Would be nice to explore idea of storing offsets of consumed messages in listeners themselves, very much like Kafka consumers. (based on my limited knowledge of spark event queue listeners, assuming each listeners don't depend on each other and can read from the queue asynchronously) - so then if one of the "non-critical" listeners can't keep up, messages will be lost just for that one listener, and it wouldn't affect rest of listeners. {quote}Alternatively, we could use two queues, one for internal listeners and another for external ones{quote}Making a parallel with Kafka again, looks like we're talking here about two "topics" > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016663#comment-16016663 ] Josh Rosen commented on SPARK-18838: [~bOOmX], do you have CPU-time profiling within each of those listeners? I'm wondering why StorageListener is so slow. Although it's not a real solution, I bet that we could make a significant constant-factor improvement to StorageListener (perhaps by using more imperative Java-style code instead of Scala collections). > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20803) KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not
Bettadapura Srinath Sharma created SPARK-20803: -- Summary: KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not normally distributed) Key: SPARK-20803 URL: https://issues.apache.org/jira/browse/SPARK-20803 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 2.1.1 Environment: Linux version 4.4.14-smp x86/fpu: Legacy x87 FPU detected. using command line: bash-4.3$ ./bin/spark-submit ~/work/python/Features.py bash-4.3$ pwd /home/bsrsharma/spark-2.1.1-bin-hadoop2.7 export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121 Reporter: Bettadapura Srinath Sharma When data is NOT normally distributed (correct behavior): This code: vecRDD = sc.parallelize(colVec) kd = KernelDensity() kd.setSample(vecRDD) kd.setBandwidth(3.0) # Find density estimates for the given values densities = kd.estimate(samplePoints) produces: 17/05/18 15:40:36 INFO SparkContext: Starting job: aggregate at KernelDensity.scala:92 17/05/18 15:40:36 INFO DAGScheduler: Got job 21 (aggregate at KernelDensity.scala:92) with 1 output partitions 17/05/18 15:40:36 INFO DAGScheduler: Final stage: ResultStage 24 (aggregate at KernelDensity.scala:92) 17/05/18 15:40:36 INFO DAGScheduler: Parents of final stage: List() 17/05/18 15:40:36 INFO DAGScheduler: Missing parents: List() 17/05/18 15:40:36 INFO DAGScheduler: Submitting ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345), which has no missing parents 17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 6.6 KB, free 413.6 MB) 17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 3.6 KB, free 413.6 MB) 17/05/18 15:40:36 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 192.168.0.115:38645 (size: 3.6 KB, free: 413.9 MB) 17/05/18 15:40:36 INFO SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:996 17/05/18 15:40:36 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345) 17/05/18 15:40:36 INFO TaskSchedulerImpl: Adding task set 24.0 with 1 tasks 17/05/18 15:40:36 INFO TaskSetManager: Starting task 0.0 in stage 24.0 (TID 24, localhost, executor driver, partition 0, PROCESS_LOCAL, 96186 bytes) 17/05/18 15:40:36 INFO Executor: Running task 0.0 in stage 24.0 (TID 24) 17/05/18 15:40:37 INFO PythonRunner: Times: total = 66, boot = -1831, init = 1844, finish = 53 17/05/18 15:40:37 INFO Executor: Finished task 0.0 in stage 24.0 (TID 24). 2476 bytes result sent to driver 17/05/18 15:40:37 INFO DAGScheduler: ResultStage 24 (aggregate at KernelDensity.scala:92) finished in 1.001 s 17/05/18 15:40:37 INFO TaskSetManager: Finished task 0.0 in stage 24.0 (TID 24) in 1004 ms on localhost (executor driver) (1/1) 17/05/18 15:40:37 INFO TaskSchedulerImpl: Removed TaskSet 24.0, whose tasks have all completed, from pool 17/05/18 15:40:37 INFO DAGScheduler: Job 21 finished: aggregate at KernelDensity.scala:92, took 1.136263 s 17/05/18 15:40:37 INFO BlockManagerInfo: Removed broadcast_25_piece0 on 192.168.0.115:38645 in memory (size: 3.6 KB, free: 413.9 MB) 5.6654703477e-05,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001 ,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, 0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001, But if Data IS n
[jira] [Commented] (SPARK-20784) Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() in YARN client mode
[ https://issues.apache.org/jira/browse/SPARK-20784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016600#comment-16016600 ] Hyukjin Kwon commented on SPARK-20784: -- Thank you for checking this out. > Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() > in YARN client mode > - > > Key: SPARK-20784 > URL: https://issues.apache.org/jira/browse/SPARK-20784 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.1 >Reporter: Mathieu D > > Spark hangs and stop executing any job or task (v2.0.2). > Web UI shows *0 active stages* and *0 active task* on executors, although a > driver thread is clearly working/finishing a stage (see below). > Our application runs several spark contexts for several users in parallel in > threads. spark version 2.0.2, yarn-client > Extract of thread stack below. > {noformat} > "ForkJoinPool-1-worker-0" #107 daemon prio=5 os_prio=0 tid=0x7fddf0005800 > nid=0x484 waiting on condition [0x7fddd0bf > 6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00078c232760> (a > scala.concurrent.impl.Promise$CompletionLatch) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at > org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:212) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scal
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016579#comment-16016579 ] Antoine PRANG commented on SPARK-18838: --- [~joshrosen][~sitalke...@gmail.com] I have measured the listener execution time on my test case. You can ind below for each listener the average percentage of the total execution time for a message (I have instrumented ListenerBus) EnvironmentListener :0.0 EventLoggingListener :48.2 ExecutorsListener :0.1 HeartbeatReceiver :0.0 JobProgressListener :6.9 PepperdataSparkListener :2.7 RDDOperationGraphListener :0.1 StorageListener :38.3 StorageStatusListener :0.4 The execution time is concentrated on 2 Listeners: EventLoggingListener, StorageListener. I think that putting parallelization at the listener bus is not a so good idea. Duplicating the messages in 2 queues will change the current synchronization contract (all listeners receive each message in the same time, they are ahead or behind other listener from 1 and only 1 message). For me the best idea would be to keep the listener bus as simple as now (N producers - 1 consumer) to be able to take advantage of that to dequeue as fast as possible and introduce parallelisation at the listener level - being aware of the synchronization contract - when it is possible. The EventLoggingListener can for example be executed asynchronously. I am doing a commit to do that right now. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19076) Upgrade Hive dependence to Hive 2.x
[ https://issues.apache.org/jira/browse/SPARK-19076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016309#comment-16016309 ] William Handy commented on SPARK-19076: --- It seems like it was decided that this was too difficult, but I wanted to point out that hive 2.1 has multithreaded writes with settings hive.mv.files.thread and hive.metastore.fshandler.threads. If you happen to be using spark on S3, these settings would be a significant performance boost. There are several articles talking about using these settings in the context of "Hive on Spark", when I want to see them in "Hive _in_ Spark" instead :-/ > Upgrade Hive dependence to Hive 2.x > --- > > Key: SPARK-19076 > URL: https://issues.apache.org/jira/browse/SPARK-19076 > Project: Spark > Issue Type: Improvement >Reporter: Dapeng Sun > > Currently the upstream Spark depends on Hive 1.2.1 to build package, and Hive > 2.0 has been released in February 2016, Hive 2.0.1 and 2.1.0 also released > for a long time, at Spark side, it is better to support Hive 2.0 and above. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20389) Upgrade kryo to fix NegativeArraySizeException
[ https://issues.apache.org/jira/browse/SPARK-20389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016305#comment-16016305 ] Louis Bergelson commented on SPARK-20389: - What's the process for evaluating the effect on spark apps? As far as I can tell the changes from 3->4 will mean that data serialized with an older version of kryo will not be loadable by a new version of kryo unless you run with a compatibility option configured. Hopefully most apps aren't storing data that's been serialized by kryo and only using it for serialization between processes. I don't know if this has knock on effects for things like parquet though, does it use kryo? We have a recurrent issue when serializing large objects that is fixed in kryo 4 and would really like to see spark updated. > Upgrade kryo to fix NegativeArraySizeException > -- > > Key: SPARK-20389 > URL: https://issues.apache.org/jira/browse/SPARK-20389 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 2.1.0 > Environment: Linux, Centos7, jdk8 >Reporter: Georg Heiler > > I am experiencing an issue with Kryo when writing parquet files. Similar to > https://github.com/broadinstitute/gatk/issues/1524 a > NegativeArraySizeException occurs. Apparently this is fixed in a current Kryo > version. Spark is still using the very old 3.3 Kryo. > Can you please upgrade to a fixed Kryo version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Affects Version/s: (was: 2.1.0) 2.1.1 > Data missing after insert overwrite table partition which is created on > specific location > - > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: kobefeng > > Use spark sql to create partition table first, and alter table by adding > partition on specific location, then insert overwrite into this partition by > selection, which will cause data missing compared with HIVE. > {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} > -- create partition table first > $ hadoop fs -mkdir /user/kofeng/partitioned_table > $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql > spark-sql> create table kofeng.partitioned_table( > > id bigint, > > name string, > > dt string > > ) using parquet options ('compression'='snappy', > 'path'='/user/kofeng/partitioned_table') > > partitioned by (dt); > -- add partition with specific location > spark-sql> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > $ hadoop fs -ls /user/kofeng/partitioned_table > drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 > /user/kofeng/partitioned_table/20170507 > -- insert overwrite this partition, and the specific location folder gone, > data is missing, job is success by attaching _SUCCESS > spark-sql> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name; > $ hadoop fs -ls /user/kofeng/partitioned_table > -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 > /user/kofeng/partitioned_table/_SUCCESS > > > -- Then drop this partition and use hive to add partition and insert > overwrite this partition data, then verify: > spark-sql> alter table kofeng.partitioned_table drop if exists > partition(dt='20170507'); > hive> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > OK > -- could see hive insert overwrite that specific location successfully. > hive> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; > hive> select * from kofeng.partitioned_table; > OK > 123 kofeng 20170507 > $ hadoop fs -ls /user/kofeng/partitioned_table/20170507 > -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26 > /user/kofeng/partitioned_table/20170507/00_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
[ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kobefeng updated SPARK-20663: - Labels: (was: easyfix) > Data missing after insert overwrite table partition which is created on > specific location > - > > Key: SPARK-20663 > URL: https://issues.apache.org/jira/browse/SPARK-20663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: kobefeng > > Use spark sql to create partition table first, and alter table by adding > partition on specific location, then insert overwrite into this partition by > selection, which will cause data missing compared with HIVE. > {code:title=partition_table_insert_overwrite.sql|borderStyle=solid} > -- create partition table first > $ hadoop fs -mkdir /user/kofeng/partitioned_table > $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql > spark-sql> create table kofeng.partitioned_table( > > id bigint, > > name string, > > dt string > > ) using parquet options ('compression'='snappy', > 'path'='/user/kofeng/partitioned_table') > > partitioned by (dt); > -- add partition with specific location > spark-sql> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > $ hadoop fs -ls /user/kofeng/partitioned_table > drwxr-xr-x - kofeng kofeng 0 2017-05-08 17:00 > /user/kofeng/partitioned_table/20170507 > -- insert overwrite this partition, and the specific location folder gone, > data is missing, job is success by attaching _SUCCESS > spark-sql> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name; > $ hadoop fs -ls /user/kofeng/partitioned_table > -rw-r--r-- 3 kofeng kofeng 0 2017-05-08 17:06 > /user/kofeng/partitioned_table/_SUCCESS > > > -- Then drop this partition and use hive to add partition and insert > overwrite this partition data, then verify: > spark-sql> alter table kofeng.partitioned_table drop if exists > partition(dt='20170507'); > hive> alter table kofeng.partitioned_table add if not exists > partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507'; > OK > -- could see hive insert overwrite that specific location successfully. > hive> insert overwrite table kofeng.partitioned_table > partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test; > hive> select * from kofeng.partitioned_table; > OK > 123 kofeng 20170507 > $ hadoop fs -ls /user/kofeng/partitioned_table/20170507 > -rwxr-xr-x 3 kofeng kofeng 338 2017-05-08 17:26 > /user/kofeng/partitioned_table/20170507/00_0 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16627) --jars doesn't work in Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016253#comment-16016253 ] Michael Gummelt commented on SPARK-16627: - I'm not completely sure, but I believe that the dispatcher is correctly setting {{spark.jars}}, but due to SPARK-10643, the driver is not recognizing the remote jar URL. > --jars doesn't work in Mesos mode > - > > Key: SPARK-16627 > URL: https://issues.apache.org/jira/browse/SPARK-16627 > Project: Spark > Issue Type: Bug > Components: Mesos >Reporter: Michael Gummelt > > Definitely doesn't work in cluster mode. Might not work in client mode > either. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20802) kolmogorovSmirnovTest in pyspark.mllib.stat.Statistics throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not nor
Bettadapura Srinath Sharma created SPARK-20802: -- Summary: kolmogorovSmirnovTest in pyspark.mllib.stat.Statistics throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not normally distributed) Key: SPARK-20802 URL: https://issues.apache.org/jira/browse/SPARK-20802 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 2.1.1 Environment: Linux version 4.4.14-smp x86/fpu: Legacy x87 FPU detected. using command line: bash-4.3$ ./bin/spark-submit ~/work/python/Features.py bash-4.3$ pwd /home/bsrsharma/spark-2.1.1-bin-hadoop2.7 export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121 Reporter: Bettadapura Srinath Sharma In Scala,(correct behavior) code: testResult = Statistics.kolmogorovSmirnovTest(vecRDD, "norm", means(j), stdDev(j)) produces: 17/05/18 10:52:53 INFO FeatureLogger: Kolmogorov-Smirnov test summary: degrees of freedom = 0 statistic = 0.005495681749849268 pValue = 0.9216108887428276 No presumption against null hypothesis: Sample follows theoretical distribution. in python (incorrect behavior): the code: testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], numericSD[j]) causes this error: 17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14) net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12559) Cluster mode doesn't work with --packages
[ https://issues.apache.org/jira/browse/SPARK-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016243#comment-16016243 ] Michael Gummelt commented on SPARK-12559: - I changed the title from "Standalone cluster mode" to "cluster mode", since --packages doesn't work with any cluster mode. > Cluster mode doesn't work with --packages > - > > Key: SPARK-12559 > URL: https://issues.apache.org/jira/browse/SPARK-12559 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.0 >Reporter: Andrew Or > > From the mailing list: > {quote} > Another problem I ran into that you also might is that --packages doesn't > work with --deploy-mode cluster. It downloads the packages to a temporary > location on the node running spark-submit, then passes those paths to the > node that is running the Driver, but since that isn't the same machine, it > can't find anything and fails. The driver process *should* be the one > doing the downloading, but it isn't. I ended up having to create a fat JAR > with all of the dependencies to get around that one. > {quote} > The problem is that we currently don't upload jars to the cluster. It seems > to fix this we either (1) do upload jars, or (2) just run the packages code > on the driver side. I slightly prefer (2) because it's simpler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12559) Cluster mode doesn't work with --packages
[ https://issues.apache.org/jira/browse/SPARK-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-12559: Summary: Cluster mode doesn't work with --packages (was: Standalone cluster mode doesn't work with --packages) > Cluster mode doesn't work with --packages > - > > Key: SPARK-12559 > URL: https://issues.apache.org/jira/browse/SPARK-12559 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.0 >Reporter: Andrew Or > > From the mailing list: > {quote} > Another problem I ran into that you also might is that --packages doesn't > work with --deploy-mode cluster. It downloads the packages to a temporary > location on the node running spark-submit, then passes those paths to the > node that is running the Driver, but since that isn't the same machine, it > can't find anything and fails. The driver process *should* be the one > doing the downloading, but it isn't. I ended up having to create a fat JAR > with all of the dependencies to get around that one. > {quote} > The problem is that we currently don't upload jars to the cluster. It seems > to fix this we either (1) do upload jars, or (2) just run the packages code > on the driver side. I slightly prefer (2) because it's simpler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
[ https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016217#comment-16016217 ] Sital Kedia commented on SPARK-18838: - [~joshrosen] - >> Alternatively, we could use two queues, one for internal listeners and another for external ones. This wouldn't be as fine-grained as thread-per-listener but might buy us a lot of the benefits with perhaps less code needed. Actually that is exactly what my PR is doing. https://github.com/apache/spark/pull/16291. I have not been able to work on it recently, but you can take a look and let me know how it looks. I can prioritize working on it. > High latency of event processing for large jobs > --- > > Key: SPARK-18838 > URL: https://issues.apache.org/jira/browse/SPARK-18838 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Currently we are observing the issue of very high event processing delay in > driver's `ListenerBus` for large jobs with many tasks. Many critical > component of the scheduler like `ExecutorAllocationManager`, > `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might > hurt the job performance significantly or even fail the job. For example, a > significant delay in receiving the `SparkListenerTaskStart` might cause > `ExecutorAllocationManager` manager to mistakenly remove an executor which is > not idle. > The problem is that the event processor in `ListenerBus` is a single thread > which loops through all the Listeners for each event and processes each event > synchronously > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94. > This single threaded processor often becomes the bottleneck for large jobs. > Also, if one of the Listener is very slow, all the listeners will pay the > price of delay incurred by the slow listener. In addition to that a slow > listener can cause events to be dropped from the event queue which might be > fatal to the job. > To solve the above problems, we propose to get rid of the event queue and the > single threaded event processor. Instead each listener will have its own > dedicate single threaded executor service . When ever an event is posted, it > will be submitted to executor service of all the listeners. The Single > threaded executor service will guarantee in order processing of the events > per listener. The queue used for the executor service will be bounded to > guarantee we do not grow the memory indefinitely. The downside of this > approach is separate event queue per listener will increase the driver memory > footprint. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization
[ https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016168#comment-16016168 ] Ruslan Dautkhanov commented on SPARK-20776: --- Thank you [~joshrosen]. Would it be possible to backport this patch to Spark 2.1 as well? > Fix JobProgressListener perf. problems caused by empty TaskMetrics > initialization > - > > Key: SPARK-20776 > URL: https://issues.apache.org/jira/browse/SPARK-20776 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.2.0 > > > In > {code} > ./bin/spark-shell --master=local[64] > {code} > I ran > {code} > sc.parallelize(1 to 10, 10).count() > {code} > and profiled the time spend in the LiveListenerBus event processing thread. I > discovered that the majority of the time was being spent constructing empty > TaskMetrics instances inside JobProgressListener. As I'll show in a PR, we > can slightly simplify the code to remove the need to construct one empty > TaskMetrics per onTaskSubmitted event. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20364. - Resolution: Fixed Fix Version/s: 2.2.0 > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 2.2.0 > > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-20364: --- Assignee: Hyukjin Kwon > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Fix For: 2.2.0 > > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016116#comment-16016116 ] yuhao yang commented on SPARK-20768: Thanks for the ping. [~mlnick] We should just treat it as an expert param. Normally in python it should be exposed as a Param in my impression. > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016061#comment-16016061 ] yuhao yang commented on SPARK-20797: [~d0evi1] Thanks for reporting the issue and proposal for the fix. Would you send a PR for the fix? > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong
[ https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20796: - Assignee: liuzhaokun > the location of start-master.sh in spark-standalone.md is wrong > --- > > Key: SPARK-20796 > URL: https://issues.apache.org/jira/browse/SPARK-20796 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Assignee: liuzhaokun >Priority: Trivial > Fix For: 2.1.2, 2.2.0 > > > the location of start-master.sh in spark-standalone.md should be > "sbin/start-master.sh" rather than "bin/start-master.sh". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong
[ https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20796. --- Resolution: Fixed Fix Version/s: 2.1.2 2.2.0 Issue resolved by pull request 18027 [https://github.com/apache/spark/pull/18027] > the location of start-master.sh in spark-standalone.md is wrong > --- > > Key: SPARK-20796 > URL: https://issues.apache.org/jira/browse/SPARK-20796 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Priority: Trivial > Fix For: 2.2.0, 2.1.2 > > > the location of start-master.sh in spark-standalone.md should be > "sbin/start-master.sh" rather than "bin/start-master.sh". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20779) The ASF header placed in an incorrect location in some files
[ https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20779. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18012 [https://github.com/apache/spark/pull/18012] > The ASF header placed in an incorrect location in some files > > > Key: SPARK-20779 > URL: https://issues.apache.org/jira/browse/SPARK-20779 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.1.0, 2.1.1 >Reporter: zuotingbing >Priority: Trivial > Fix For: 2.3.0 > > > when i test some examples, i found the license is not at the top in some > files. and it will be best if we update these places of the ASF header to be > consistent with other files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20779) The ASF header placed in an incorrect location in some files
[ https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20779: - Assignee: zuotingbing > The ASF header placed in an incorrect location in some files > > > Key: SPARK-20779 > URL: https://issues.apache.org/jira/browse/SPARK-20779 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.1.0, 2.1.1 >Reporter: zuotingbing >Assignee: zuotingbing >Priority: Trivial > Fix For: 2.3.0 > > > when i test some examples, i found the license is not at the top in some > files. and it will be best if we update these places of the ASF header to be > consistent with other files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016021#comment-16016021 ] Mitesh commented on SPARK-17867: Ah I see, thanks [~viirya]. The repartitionByColumns is just a short-cut method I created. But I do have some aliasing code changes compared to 2.1, I will try to remove those and see if that is whats breaking it. > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015964#comment-16015964 ] Brian Albright commented on SPARK-14492: Ran into this over the past week. Here are some gists to illustrate the issue as I see it. pom.xml: https://gist.github.com/MisterSpicy/3ae3205ba09668b8e9b9fc2d1655646e the test: https://gist.github.com/MisterSpicy/20ef8cc2777dc82e305f29a507ae83d7 > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.
jin xing created SPARK-20801: Summary: Store accurate size of blocks in MapStatus when it's above threshold. Key: SPARK-20801 URL: https://issues.apache.org/jira/browse/SPARK-20801 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.1 Reporter: jin xing Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is used to store size of blocks. in HighlyCompressedMapStatus, only average size is stored for non empty blocks. Which is not good for memory control when we shuffle blocks. It makes sense to store the accurate size of block when it's above threshold. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong
[ https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015914#comment-16015914 ] liuzhaokun commented on SPARK-20796: @Sean Owen I am so sorry about it.And I will take your advice. > the location of start-master.sh in spark-standalone.md is wrong > --- > > Key: SPARK-20796 > URL: https://issues.apache.org/jira/browse/SPARK-20796 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Priority: Trivial > > the location of start-master.sh in spark-standalone.md should be > "sbin/start-master.sh" rather than "bin/start-master.sh". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20784) Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() in YARN client mode
[ https://issues.apache.org/jira/browse/SPARK-20784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu D closed SPARK-20784. - Resolution: Not A Bug Oh boy, it was an OOM on the driver. Most of the times, it was silent. I just discovered an OOM exception in the middle of the logs in a task-result-getter. I guess the BroadcastExchange was just waiting for it > Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() > in YARN client mode > - > > Key: SPARK-20784 > URL: https://issues.apache.org/jira/browse/SPARK-20784 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.1 >Reporter: Mathieu D > > Spark hangs and stop executing any job or task (v2.0.2). > Web UI shows *0 active stages* and *0 active task* on executors, although a > driver thread is clearly working/finishing a stage (see below). > Our application runs several spark contexts for several users in parallel in > threads. spark version 2.0.2, yarn-client > Extract of thread stack below. > {noformat} > "ForkJoinPool-1-worker-0" #107 daemon prio=5 os_prio=0 tid=0x7fddf0005800 > nid=0x484 waiting on condition [0x7fddd0bf > 6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00078c232760> (a > scala.concurrent.impl.Promise$CompletionLatch) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at > org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:212) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) > at > org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.InputAdapt
[jira] [Commented] (SPARK-20782) Dataset's isCached operator
[ https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015872#comment-16015872 ] Wenchen Fan commented on SPARK-20782: - one alternative is to create a temp view and cache the view, then we can use `spark.catalog.isCache(viewName)` > Dataset's isCached operator > --- > > Key: SPARK-20782 > URL: https://issues.apache.org/jira/browse/SPARK-20782 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > It'd be very convenient to have {{isCached}} operator that would say whether > a query is cached in-memory or not. > It'd be as simple as the following snippet: > {code} > // val q2: DataFrame > spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015852#comment-16015852 ] Liang-Chi Hsieh commented on SPARK-17867: - The above example code can't compile with current codebase. There is no repartitionByColumns but only repartition. {code} val df = Seq((1, 2, 3, "hi"), (1, 2, 4, "hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartition($"userid") .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartition($"userid") .sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") {code} The optimized plan looks like: {code} Sort [userid#9 ASC NULLS FIRST], false +- RepartitionByExpression [userid#9], 5 +- Filter (isnotnull(del#12) && NOT (del#12 = hi)) +- Aggregate [eventid#10], [first(userid#9, false) AS userid#9, eventid#10, first(vk#11, false) AS vk#11, first(del#12, false) AS del#12] +- Sort [userid#9 ASC NULLS FIRST, eventid#10 ASC NULLS FIRST, vk#11 DESC NULLS LAST], false +- RepartitionByExpression [userid#9], 5 +- LocalRelation [userid#9, eventid#10, vk#11, del#12] {code} The spark plan looks like: {code} Sort [userid#9 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(userid#9, 5) +- Filter (isnotnull(del#12) && NOT (del#12 = hi)) +- SortAggregate(key=[eventid#10], functions=[first(userid#9, false), first(vk#11, false), first(del#12, false)], output=[userid#9, eventid#10, vk#11, del#12]) +- SortAggregate(key=[eventid#10], functions=[partial_first(userid#9, false), partial_first(vk#11, false), partial_first(del#12, false)], output=[eventid#10, first#35, valueSet#36, first#37, valueSet#38, first#39, valueSet#40]) +- Sort [userid#9 ASC NULLS FIRST, eventid#10 ASC NULLS FIRST, vk#11 DESC NULLS LAST], false, 0 +- Exchange hashpartitioning(userid#9, 5) +- LocalTableScan [userid#9, eventid#10, vk#11, del#12] {code} Looks like the "del <> 'hi'" filter doesn't be pushed down? > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20775) from_json should also have an API where the schema is specified with a string
[ https://issues.apache.org/jira/browse/SPARK-20775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-20775: - Issue Type: Improvement (was: Bug) > from_json should also have an API where the schema is specified with a string > - > > Key: SPARK-20775 > URL: https://issues.apache.org/jira/browse/SPARK-20775 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Burak Yavuz > > Right now you also have to provide a java.util.Map which is not nice for > Scala users. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20775) from_json should also have an API where the schema is specified with a string
[ https://issues.apache.org/jira/browse/SPARK-20775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015819#comment-16015819 ] Takeshi Yamamuro commented on SPARK-20775: -- Since I feel this is no a bug, I'll change the issue type. > from_json should also have an API where the schema is specified with a string > - > > Key: SPARK-20775 > URL: https://issues.apache.org/jira/browse/SPARK-20775 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Burak Yavuz > > Right now you also have to provide a java.util.Map which is not nice for > Scala users. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] d0evi1 updated SPARK-20797: --- Summary: mllib lda's LocalLDAModel's save: out of memory. (was: mllib lda load and save out of memory. ) > mllib lda's LocalLDAModel's save: out of memory. > - > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20797) mllib lda load and save out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] d0evi1 updated SPARK-20797: --- Description: when i try online lda model with large text data(nearly 1 billion chinese news' abstract), the training step went well, but the save step failed. something like below happened (etc. 1.6.1): problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the param can fix problem 1, but next will lead problem 2), problem 2. exceed spark.akka.frameSize. (turning this param too bigger will fail for the reason out of memory, kill it, version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize). when topics num is large(set topic num k=200 is ok, but set k=300 failed), and vocab size is large(nearly 1000,000) too. this problem will appear. so i found word2vec's save function is similar to the LocalLDAModel's save function : word2vec's problem (use repartition(1) to save) has been fixed [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: repartition(1). use single partition when save. word2vec's save method from latest code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: val approxSize = (4L * vectorSize + 15) * numWords val nPartitions = ((approxSize / bufferSize) + 1).toInt val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) but the code in mllib.clustering.LDAModel's LocalLDAModel's save: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala you'll see: val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix val topics = Range(0, k).map { topicInd => Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), topicInd) } spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) refer to word2vec's save (repartition(nPartitions)), i replace numWords to topic K, repartition(nPartitions) in the LocalLDAModel's save method, recompile the code, deploy the new lda's project with large data on our machine cluster, it works. hopes it will fixed in the next version. was: when i try online lda model with large text data, the training step went well, but the save step failed. but something like below happened (etc. 1.6.1): 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the param can fixed), 2. exceed spark.akka.frameSize. (turning this param too bigger will fail, version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize). when topics num is large, and vocab size is large too. this problem will appear. so i found this: https://github.com/apache/spark/pull/9989, word2vec's problem has been fixed, this is word2vec's save method from latest code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: val approxSize = (4L * vectorSize + 15) * numWords val nPartitions = ((approxSize / bufferSize) + 1).toInt val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) but the code in mllib.clustering.LDAModel's save: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala you'll see: val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix val topics = Range(0, k).map { topicInd => Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), topicInd) } spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) i try word2vec's save, replace numWords to topic K, repartition(nPartitions), recompile the code, deploy the new lda's project with large data on our machine cluster, it works. hopes it will fixed in the next version. > mllib lda load and save out of memory. > --- > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize)
[jira] [Commented] (SPARK-20797) mllib lda load and save out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015813#comment-16015813 ] d0evi1 commented on SPARK-20797: sorry for my poor english. i rewrite the problem. just one topic: MLlib's LocalLDAModel's save. > mllib lda load and save out of memory. > --- > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data(nearly 1 billion chinese > news' abstract), the training step went well, but the save step failed. > something like below happened (etc. 1.6.1): > problem 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the > param can fix problem 1, but next will lead problem 2), > problem 2. exceed spark.akka.frameSize. (turning this param too bigger will > fail for the reason out of memory, kill it, version > 2.0.0, exceeds max > allowed: spark.rpc.message.maxSize). > when topics num is large(set topic num k=200 is ok, but set k=300 failed), > and vocab size is large(nearly 1000,000) too. this problem will appear. > so i found word2vec's save function is similar to the LocalLDAModel's save > function : > word2vec's problem (use repartition(1) to save) has been fixed > [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use: > repartition(1). use single partition when save. > word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's LocalLDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > refer to word2vec's save (repartition(nPartitions)), i replace numWords to > topic K, repartition(nPartitions) in the LocalLDAModel's save method, > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20782) Dataset's isCached operator
[ https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015810#comment-16015810 ] Takeshi Yamamuro commented on SPARK-20782: -- This short-cut makes some sense to me. WDYT? cc: [~smilegator][~cloud_fan] > Dataset's isCached operator > --- > > Key: SPARK-20782 > URL: https://issues.apache.org/jira/browse/SPARK-20782 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > > It'd be very convenient to have {{isCached}} operator that would say whether > a query is cached in-memory or not. > It'd be as simple as the following snippet: > {code} > // val q2: DataFrame > spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20800) Allow users to set job group when connecting through the SQL thrift server
[ https://issues.apache.org/jira/browse/SPARK-20800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Zeyl updated SPARK-20800: - Description: It would be useful for users to be able to set the job group through thrift server clients like beeline so that jobs in the event log could be grouped together logically. This would help in tracking the performance of repeated runs of similar sql queries, which could be tagged by the user with the same job group id. Currently each sql query, and corresponding job, is assigned a random UUID as the job group. Ideally users could set the job group in two ways: 1. by issuing a sql command prior to their query (for example, SET spark.sql.thriftserver.jobGroupID=jobA) 2. by passing a hive conf parameter through beeline to set the job group for the session. Alternatively, if people think the job group needs to be a random UUID for each sql query, introducing another parameter that could be written into the job properties field of the event log would be helpful for tracking the performance of repeated runs. was: It would be useful for users to be able to set the job group through thrift server clients like beeline so that jobs in the event log could be grouped together logically. This would help in tracking the performance of repeated runs of similar sql queries, which could be tagged by the user with the same job group id. Currently each sql query, and corresponding job, is assigned a random UUID as the job group. Ideally users could set the job group in two ways: 1. by issuing a sql command prior to their query (for example, SET spark.sql.thriftserver.jobGroupID=jobA;) 2. by passing a hive conf parameter through beeline to set the job group for the session. Alternatively, if people think the job group needs to be a random UUID for each sql query, introducing another parameter that could be written into the job properties field of the event log would be helpful for tracking the performance of repeated runs. > Allow users to set job group when connecting through the SQL thrift server > -- > > Key: SPARK-20800 > URL: https://issues.apache.org/jira/browse/SPARK-20800 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tim Zeyl >Priority: Minor > > It would be useful for users to be able to set the job group through thrift > server clients like beeline so that jobs in the event log could be grouped > together logically. This would help in tracking the performance of repeated > runs of similar sql queries, which could be tagged by the user with the same > job group id. Currently each sql query, and corresponding job, is assigned a > random UUID as the job group. > Ideally users could set the job group in two ways: > 1. by issuing a sql command prior to their query (for example, SET > spark.sql.thriftserver.jobGroupID=jobA) > 2. by passing a hive conf parameter through beeline to set the job group for > the session. > Alternatively, if people think the job group needs to be a random UUID for > each sql query, introducing another parameter that could be written into the > job properties field of the event log would be helpful for tracking the > performance of repeated runs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20747) Distinct in Aggregate Functions
[ https://issues.apache.org/jira/browse/SPARK-20747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015776#comment-16015776 ] Takeshi Yamamuro commented on SPARK-20747: -- You mean this query below? {code} scala> Seq((1, 1), (1, 1), (1, 1)).toDF("a", "b").createOrReplaceTempView("t") scala> sql("""select a, avg(distinct b) from t group by a""").show +---+---+ | a|avg(DISTINCT b)| +---+---+ | 1|1.0| +---+---+ {code} It seems these syntaxes already supported though, am I missing something? > Distinct in Aggregate Functions > --- > > Key: SPARK-20747 > URL: https://issues.apache.org/jira/browse/SPARK-20747 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > > {noformat} > AVG ([DISTINCT]|[ALL] ) > MAX ([DISTINCT]|[ALL] ) > MIN ([DISTINCT]|[ALL] ) > SUM ([DISTINCT]|[ALL] ) > {noformat} > Except COUNT, the DISTINCT clause is not supported by Spark SQL -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20800) Allow users to set job group when connecting through the SQL thrift server
Tim Zeyl created SPARK-20800: Summary: Allow users to set job group when connecting through the SQL thrift server Key: SPARK-20800 URL: https://issues.apache.org/jira/browse/SPARK-20800 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Tim Zeyl Priority: Minor It would be useful for users to be able to set the job group through thrift server clients like beeline so that jobs in the event log could be grouped together logically. This would help in tracking the performance of repeated runs of similar sql queries, which could be tagged by the user with the same job group id. Currently each sql query, and corresponding job, is assigned a random UUID as the job group. Ideally users could set the job group in two ways: 1. by issuing a sql command prior to their query (for example, SET spark.sql.thriftserver.jobGroupID=jobA;) 2. by passing a hive conf parameter through beeline to set the job group for the session. Alternatively, if people think the job group needs to be a random UUID for each sql query, introducing another parameter that could be written into the job properties field of the event log would be helpful for tracking the performance of repeated runs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM: - I'm seeing a regression from this change, the last `del <> 'hi'` filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last {del <> 'hi'} filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM: - I'm seeing a regression from this change, the last "del <> 'hi'" filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last `del <> 'hi'` filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM: - I'm seeing a regression from this change, the last {del <> 'hi'} filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:49 PM: - I'm seeing a regression from this change, the last "del <> 'hi'" filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last "del <> 'hi'" filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh commented on SPARK-17867: I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:scala} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name
[ https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772 ] Mitesh edited comment on SPARK-17867 at 5/18/17 1:47 PM: - I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:none} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} was (Author: masterddt): I'm seeing a regression from this change, the last filter gets pushed down past the dropDuplicates aggregation. cc [~cloud_fan] {code:scala} val df = Seq((1,2,3,"hi"), (1,2,4,"hi")) .toDF("userid", "eventid", "vk", "del") .filter("userid is not null and eventid is not null and vk is not null") .repartitionByColumns(Seq("userid")) .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk")) .dropDuplicates("eventid") .filter("userid is not null") .repartitionByColumns(Seq("userid")). sortWithinPartitions(asc("userid")) .filter("del <> 'hi'") // filter should not be pushed down to the local table scan df.queryExecution.sparkPlan.collect { case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) => assert(false, s"$f was pushed down to $t") {code} > Dataset.dropDuplicates (i.e. distinct) should consider the columns with same > column name > > > Key: SPARK-17867 > URL: https://issues.apache.org/jira/browse/SPARK-17867 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We find and get the first resolved attribute from output with the given > column name in Dataset.dropDuplicates. When we have the more than one columns > with the same name. Other columns are put into aggregation columns, instead > of grouping columns. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter
[ https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ala Luszczak updated SPARK-20798: - Description: GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption that one should first make sure the value is not null before calling the getter. This can lead to errors. An example of generated code: {noformat} /* 059 */ final UTF8String fieldName = value.getUTF8String(0); /* 060 */ if (value.isNullAt(0)) { /* 061 */ rowWriter1.setNullAt(0); /* 062 */ } else { /* 063 */ rowWriter1.write(0, fieldName); /* 064 */ } {noformat} was: GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption that one should first make sure the value is not null before calling the getter. An example of generated code: {noformat} /* 059 */ final UTF8String fieldName = value.getUTF8String(0); /* 060 */ if (value.isNullAt(0)) { /* 061 */ rowWriter1.setNullAt(0); /* 062 */ } else { /* 063 */ rowWriter1.write(0, fieldName); /* 064 */ } {noformat} > GenerateUnsafeProjection should check if value is null before calling the > getter > > > Key: SPARK-20798 > URL: https://issues.apache.org/jira/browse/SPARK-20798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ala Luszczak > > GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption > that one should first make sure the value is not null before calling the > getter. This can lead to errors. > An example of generated code: > {noformat} > /* 059 */ final UTF8String fieldName = value.getUTF8String(0); > /* 060 */ if (value.isNullAt(0)) { > /* 061 */ rowWriter1.setNullAt(0); > /* 062 */ } else { > /* 063 */ rowWriter1.write(0, fieldName); > /* 064 */ } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter
[ https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20798: Assignee: Apache Spark > GenerateUnsafeProjection should check if value is null before calling the > getter > > > Key: SPARK-20798 > URL: https://issues.apache.org/jira/browse/SPARK-20798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ala Luszczak >Assignee: Apache Spark > > GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption > that one should first make sure the value is not null before calling the > getter. > An example of generated code: > {noformat} > /* 059 */ final UTF8String fieldName = value.getUTF8String(0); > /* 060 */ if (value.isNullAt(0)) { > /* 061 */ rowWriter1.setNullAt(0); > /* 062 */ } else { > /* 063 */ rowWriter1.write(0, fieldName); > /* 064 */ } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter
[ https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20798: Assignee: (was: Apache Spark) > GenerateUnsafeProjection should check if value is null before calling the > getter > > > Key: SPARK-20798 > URL: https://issues.apache.org/jira/browse/SPARK-20798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ala Luszczak > > GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption > that one should first make sure the value is not null before calling the > getter. > An example of generated code: > {noformat} > /* 059 */ final UTF8String fieldName = value.getUTF8String(0); > /* 060 */ if (value.isNullAt(0)) { > /* 061 */ rowWriter1.setNullAt(0); > /* 062 */ } else { > /* 063 */ rowWriter1.write(0, fieldName); > /* 064 */ } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter
[ https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015754#comment-16015754 ] Apache Spark commented on SPARK-20798: -- User 'ala' has created a pull request for this issue: https://github.com/apache/spark/pull/18030 > GenerateUnsafeProjection should check if value is null before calling the > getter > > > Key: SPARK-20798 > URL: https://issues.apache.org/jira/browse/SPARK-20798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ala Luszczak > > GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption > that one should first make sure the value is not null before calling the > getter. > An example of generated code: > {noformat} > /* 059 */ final UTF8String fieldName = value.getUTF8String(0); > /* 060 */ if (value.isNullAt(0)) { > /* 061 */ rowWriter1.setNullAt(0); > /* 062 */ } else { > /* 063 */ rowWriter1.write(0, fieldName); > /* 064 */ } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3
Jork Zijlstra created SPARK-20799: - Summary: Unable to infer schema for ORC on reading ORC from S3 Key: SPARK-20799 URL: https://issues.apache.org/jira/browse/SPARK-20799 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.1 Reporter: Jork Zijlstra We are getting the following exception: {code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.{code} Combining following factors will cause it: - Use S3 - Use format ORC - Don't apply a partitioning on de data - Embed AWS credentials in the path The problem is in the PartitioningAwareFileIndex def allFiles() {code} leafDirToChildrenFiles.get(qualifiedPath) .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } .getOrElse(Array.empty) {code} leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the qualifiedPath contains the path WITH credentials. So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data is read and the schema cannot be defined. Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore. Workaround: Move the AWS credentials from the path to the SparkSession {code} SparkSession.builder .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey}) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20168: Assignee: Apache Spark > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma >Assignee: Apache Spark > Labels: kinesis, streaming > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20168: Assignee: (was: Apache Spark) > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma > Labels: kinesis, streaming > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015706#comment-16015706 ] Apache Spark commented on SPARK-20168: -- User 'yssharma' has created a pull request for this issue: https://github.com/apache/spark/pull/18029 > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma > Labels: kinesis, streaming > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14864) [MLLIB] Implement Doc2Vec
[ https://issues.apache.org/jira/browse/SPARK-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015693#comment-16015693 ] Rajdeep Mondal commented on SPARK-14864: Sorry to bother. Any progress on this? > [MLLIB] Implement Doc2Vec > - > > Key: SPARK-14864 > URL: https://issues.apache.org/jira/browse/SPARK-14864 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Peter Mountanos >Priority: Minor > > It would be useful to implement Doc2Vec, as described in the paper > [Distributed Representations of Sentences and > Documents|https://cs.stanford.edu/~quocle/paragraph_vector.pdf]. Gensim has > an implementation [Deep learning with > paragraph2vec|https://radimrehurek.com/gensim/models/doc2vec.html]. > Le & Mikolov show that when aggregating Word2Vec vector representations for a > paragraph/document, it does not perform well for prediction tasks. Instead, > they propose the Paragraph Vector implementation, which provides > state-of-the-art results on several text classification and sentiment > analysis tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20797) mllib lda load and save out of memory.
[ https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015657#comment-16015657 ] Sean Owen commented on SPARK-20797: --- It's not clear what you're describing here. Can you reduce this to focus on the specific problem and change? How many topics? > mllib lda load and save out of memory. > --- > > Key: SPARK-20797 > URL: https://issues.apache.org/jira/browse/SPARK-20797 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1 >Reporter: d0evi1 > > when i try online lda model with large text data, the training step went > well, but the save step failed. but something like below happened (etc. > 1.6.1): > 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the param can > fixed), > 2. exceed spark.akka.frameSize. (turning this param too bigger will fail, > version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize). > when topics num is large, and vocab size is large too. this problem will > appear. > so i found this: > https://github.com/apache/spark/pull/9989, word2vec's problem has been fixed, > this is word2vec's save method from latest code: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: > val approxSize = (4L * vectorSize + 15) * numWords > val nPartitions = ((approxSize / bufferSize) + 1).toInt > val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } > > spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) > but the code in mllib.clustering.LDAModel's save: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala > you'll see: > val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix > val topics = Range(0, k).map { topicInd => > Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), > topicInd) > } > > spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) > i try word2vec's save, replace numWords to topic K, repartition(nPartitions), > recompile the code, deploy the new lda's project with large data on our > machine cluster, it works. > hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter
[ https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ala Luszczak updated SPARK-20798: - Description: GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption that one should first make sure the value is not null before calling the getter. An example of generated code: {noformat} /* 059 */ final UTF8String fieldName = value.getUTF8String(0); /* 060 */ if (value.isNullAt(0)) { /* 061 */ rowWriter1.setNullAt(0); /* 062 */ } else { /* 063 */ rowWriter1.write(0, fieldName); /* 064 */ } {noformat} > GenerateUnsafeProjection should check if value is null before calling the > getter > > > Key: SPARK-20798 > URL: https://issues.apache.org/jira/browse/SPARK-20798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ala Luszczak > > GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption > that one should first make sure the value is not null before calling the > getter. > An example of generated code: > {noformat} > /* 059 */ final UTF8String fieldName = value.getUTF8String(0); > /* 060 */ if (value.isNullAt(0)) { > /* 061 */ rowWriter1.setNullAt(0); > /* 062 */ } else { > /* 063 */ rowWriter1.write(0, fieldName); > /* 064 */ } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong
[ https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20796: -- Priority: Trivial (was: Major) [~liuzhaokun] please don't open a JIRA for these. They're trivial. They are not "major improvements" as you've tagged it. Read http://spark.apache.org/contributing.html > the location of start-master.sh in spark-standalone.md is wrong > --- > > Key: SPARK-20796 > URL: https://issues.apache.org/jira/browse/SPARK-20796 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Priority: Trivial > > the location of start-master.sh in spark-standalone.md should be > "sbin/start-master.sh" rather than "bin/start-master.sh". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter
Ala Luszczak created SPARK-20798: Summary: GenerateUnsafeProjection should check if value is null before calling the getter Key: SPARK-20798 URL: https://issues.apache.org/jira/browse/SPARK-20798 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Ala Luszczak -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20797) mllib lda load and save out of memory.
d0evi1 created SPARK-20797: -- Summary: mllib lda load and save out of memory. Key: SPARK-20797 URL: https://issues.apache.org/jira/browse/SPARK-20797 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.1.1, 2.0.2, 2.0.0, 1.6.3, 1.6.1 Reporter: d0evi1 when i try online lda model with large text data, the training step went well, but the save step failed. but something like below happened (etc. 1.6.1): 1.bigger than spark.kryoserializer.buffer.max. (turning bigger the param can fixed), 2. exceed spark.akka.frameSize. (turning this param too bigger will fail, version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize). when topics num is large, and vocab size is large too. this problem will appear. so i found this: https://github.com/apache/spark/pull/9989, word2vec's problem has been fixed, this is word2vec's save method from latest code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala: val approxSize = (4L * vectorSize + 15) * numWords val nPartitions = ((approxSize / bufferSize) + 1).toInt val dataArray = model.toSeq.map { case (w, v) => Data(w, v) } spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path)) but the code in mllib.clustering.LDAModel's save: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala you'll see: val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix val topics = Range(0, k).map { topicInd => Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), topicInd) } spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path)) i try word2vec's save, replace numWords to topic K, repartition(nPartitions), recompile the code, deploy the new lda's project with large data on our machine cluster, it works. hopes it will fixed in the next version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong
[ https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20796: Assignee: Apache Spark > the location of start-master.sh in spark-standalone.md is wrong > --- > > Key: SPARK-20796 > URL: https://issues.apache.org/jira/browse/SPARK-20796 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Assignee: Apache Spark > > the location of start-master.sh in spark-standalone.md should be > "sbin/start-master.sh" rather than "bin/start-master.sh". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong
[ https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20796: Assignee: (was: Apache Spark) > the location of start-master.sh in spark-standalone.md is wrong > --- > > Key: SPARK-20796 > URL: https://issues.apache.org/jira/browse/SPARK-20796 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liuzhaokun > > the location of start-master.sh in spark-standalone.md should be > "sbin/start-master.sh" rather than "bin/start-master.sh". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong
[ https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015631#comment-16015631 ] Apache Spark commented on SPARK-20796: -- User 'liu-zhaokun' has created a pull request for this issue: https://github.com/apache/spark/pull/18027 > the location of start-master.sh in spark-standalone.md is wrong > --- > > Key: SPARK-20796 > URL: https://issues.apache.org/jira/browse/SPARK-20796 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.1.1 >Reporter: liuzhaokun > > the location of start-master.sh in spark-standalone.md should be > "sbin/start-master.sh" rather than "bin/start-master.sh". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong
liuzhaokun created SPARK-20796: -- Summary: the location of start-master.sh in spark-standalone.md is wrong Key: SPARK-20796 URL: https://issues.apache.org/jira/browse/SPARK-20796 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.1.1 Reporter: liuzhaokun the location of start-master.sh in spark-standalone.md should be "sbin/start-master.sh" rather than "bin/start-master.sh". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19275) Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-19275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015591#comment-16015591 ] Aaquib Khwaja commented on SPARK-19275: --- Hi [~dmitry_iii], I also ran into a similar issue. I've set the value of 'spark.streaming.kafka.consumer.poll.ms' as 6, but i'm still running into issues. Here is the stack trace and other details: http://stackoverflow.com/questions/44045323/sparkstreamingkafka-failed-to-get-records-after-polling-for-6 Also, below are some relevant configs: batch.interval = 60s spark.streaming.kafka.consumer.poll.ms = 6 session.timeout.ms = 6 (default: 3) heartbeat.interval.ms = 6000 (default: 3000) request.timeout.ms = 9 (default: 4) Any help would be great ! Thanks. > Spark Streaming, Kafka receiver, "Failed to get records for ... after polling > for 512" > -- > > Key: SPARK-19275 > URL: https://issues.apache.org/jira/browse/SPARK-19275 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11 >Reporter: Dmitry Ochnev > > We have a Spark Streaming application reading records from Kafka 0.10. > Some tasks are failed because of the following error: > "java.lang.AssertionError: assertion failed: Failed to get records for (...) > after polling for 512" > The first attempt fails and the second attempt (retry) completes > successfully, - this is the pattern that we see for many tasks in our logs. > These fails and retries consume resources. > A similar case with a stack trace are described here: > https://www.mail-archive.com/user@spark.apache.org/msg56564.html > https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767 > Here is the line from the stack trace where the error is raised: > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, > 10, 30 and 60 seconds, but the error appeared in all the cases except the > last one. Moreover, increasing the threshold led to increasing total Spark > stage duration. > In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to > fewer task failures but with cost of total stage duration. So, it is bad for > performance when processing data streams. > We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other > related classes) which inhibits the reading process. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20795) Maximum document frequency for IDF
[ https://issues.apache.org/jira/browse/SPARK-20795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20795. --- Resolution: Invalid Please start this as a question on the mailing list. > Maximum document frequency for IDF > -- > > Key: SPARK-20795 > URL: https://issues.apache.org/jira/browse/SPARK-20795 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Turan Gojayev >Priority: Minor > > In current implementation of IDF there is no way for setting maximum number > of documents for filtering the terms. I assume that the functionality is the > same for minimum document frequency, and was wondering, if there is a special > reason for not having maxDocFreq parameter and filtering. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20795) Maximum document frequency for IDF
[ https://issues.apache.org/jira/browse/SPARK-20795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015580#comment-16015580 ] Turan Gojayev commented on SPARK-20795: --- I am a total newbie here, so excuse me if I've set anything wrong > Maximum document frequency for IDF > -- > > Key: SPARK-20795 > URL: https://issues.apache.org/jira/browse/SPARK-20795 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Turan Gojayev >Priority: Minor > > In current implementation of IDF there is no way for setting maximum number > of documents for filtering the terms. I assume that the functionality is the > same for minimum document frequency, and was wondering, if there is a special > reason for not having maxDocFreq parameter and filtering. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20795) Maximum document frequency for IDF
[ https://issues.apache.org/jira/browse/SPARK-20795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Turan Gojayev updated SPARK-20795: -- Description: In current implementation of IDF there is no way for setting maximum number of documents for filtering the terms. I assume that the functionality is the same for minimum document frequency, and was wondering, if there is a special reason for not having maxDocFreq parameter and filtering. > Maximum document frequency for IDF > -- > > Key: SPARK-20795 > URL: https://issues.apache.org/jira/browse/SPARK-20795 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Turan Gojayev >Priority: Minor > > In current implementation of IDF there is no way for setting maximum number > of documents for filtering the terms. I assume that the functionality is the > same for minimum document frequency, and was wondering, if there is a special > reason for not having maxDocFreq parameter and filtering. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20795) Maximum document frequency for IDF
Turan Gojayev created SPARK-20795: - Summary: Maximum document frequency for IDF Key: SPARK-20795 URL: https://issues.apache.org/jira/browse/SPARK-20795 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.1.0 Reporter: Turan Gojayev Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20784) Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() in YARN client mode
[ https://issues.apache.org/jira/browse/SPARK-20784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathieu D updated SPARK-20784: -- Affects Version/s: 2.1.1 Description: Spark hangs and stop executing any job or task (v2.0.2). Web UI shows *0 active stages* and *0 active task* on executors, although a driver thread is clearly working/finishing a stage (see below). Our application runs several spark contexts for several users in parallel in threads. spark version 2.0.2, yarn-client Extract of thread stack below. {noformat} "ForkJoinPool-1-worker-0" #107 daemon prio=5 os_prio=0 tid=0x7fddf0005800 nid=0x484 waiting on condition [0x7fddd0bf 6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00078c232760> (a scala.concurrent.impl.Promise$CompletionLatch) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:212) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120) at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30) at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218) at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) at org.apache.spark.sql.execution.ProjectExe
[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015455#comment-16015455 ] Nick Pentreath commented on SPARK-20768: Sure - though perhaps [~yuhaoyan] can give an opinion whether it should be added as an explicit {{Param}}? > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015448#comment-16015448 ] Yan Facai (颜发才) edited comment on SPARK-20768 at 5/18/17 8:59 AM: -- It seems easy, I can work on it. However, I'm on holiday this weekend. Is it OK to wait one week? was (Author: facai): It seems easy, I can work on it. > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015448#comment-16015448 ] Yan Facai (颜发才) commented on SPARK-20768: - It seems easy, I can work on it. > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015440#comment-16015440 ] Nick Pentreath commented on SPARK-20768: It is there - but not documented as a {{Param}} and so doesn't show up in API doc, also there is no {{set}} method for it. > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015437#comment-16015437 ] Yan Facai (颜发才) commented on SPARK-20768: - Hi, I'm newbie. `numPartitions` is found in pyspark code, could you explain more details? thanks. ```python def __init__(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", predictionCol="prediction", numPartitions=None): ``` > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16202) Misleading Description of CreatableRelationProvider's createRelation
[ https://issues.apache.org/jira/browse/SPARK-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015386#comment-16015386 ] Apache Spark commented on SPARK-16202: -- User 'jaceklaskowski' has created a pull request for this issue: https://github.com/apache/spark/pull/18026 > Misleading Description of CreatableRelationProvider's createRelation > > > Key: SPARK-16202 > URL: https://issues.apache.org/jira/browse/SPARK-16202 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > Fix For: 2.1.0 > > > The API description of {{createRelation}} in {{CreatableRelationProvider}} is > misleading. The current description only expects users to return the > relation. However, the major goal of this API should also include saving the > Dataframe. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20506) ML, Graph 2.2 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015382#comment-16015382 ] Nick Pentreath commented on SPARK-20506: Oh also SPARK-14503 is important > ML, Graph 2.2 QA: Programming guide update and migration guide > -- > > Key: SPARK-20506 > URL: https://issues.apache.org/jira/browse/SPARK-20506 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Critical > > Before the release, we need to update the MLlib and GraphX Programming > Guides. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20506) ML, Graph 2.2 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015362#comment-16015362 ] Nick Pentreath commented on SPARK-20506: Cool - I've added a section before the Migration Guide in the linked PR. Any other suggestions for items to include please say here or on the PR. Thanks! > ML, Graph 2.2 QA: Programming guide update and migration guide > -- > > Key: SPARK-20506 > URL: https://issues.apache.org/jira/browse/SPARK-20506 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath >Priority: Critical > > Before the release, we need to update the MLlib and GraphX Programming > Guides. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-17692]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015345#comment-16015345 ] Yan Facai (颜发才) commented on SPARK-19581: - [~barrybecker4] Hi, Becker. I can't reproduce the bug on spark-2.1.1-bin-hadoop2.7. 1) For 0 size of feature, the exception is harmless. ```scala val data = spark.read.format("libsvm").load("/user/facai/data/libsvm/sample_libsvm_data.txt").cache import org.apache.spark.ml.classification.NaiveBayes val model = new NaiveBayes().fit(data) import org.apache.spark.ml.linalg.{Vectors => SV} case class TestData(features: org.apache.spark.ml.linalg.Vector) val emptyVector = SV.sparse(0, Array.empty[Int], Array.empty[Double]) val test = Seq(TestData(emptyVector)).toDF scala> test.show +-+ | features| +-+ |(0,[],[])| +-+ scala> model.transform(test).show org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072) ... 48 elided Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 692, x: 0 at scala.Predef$.require(Predef.scala:224) ... 99 more ``` 2) For 692 size of empty feature, it's OK. ```scala scala> val emptyVector = SV.sparse(692, Array.empty[Int], Array.empty[Double]) emptyVector: org.apache.spark.ml.linalg.Vector = (692,[],[]) scala> val test = Seq(TestData(emptyVector)).toDF test: org.apache.spark.sql.DataFrame = [features: vector] scala> test.show +---+ | features| +---+ |(692,[],[])| +---+ scala> model.transform(test).show +---+++--+ | features| rawPrediction| probability|prediction| +---+++--+ |(692,[],[])|[-0.8407831793660...|[0.43137254901960...| 1.0| +---+++--+ ``` > running NaiveBayes model with 0 features can crash the executor with D > rorreGEMV > > > Key: SPARK-19581 > URL: https://issues.apache.org/jira/browse/SPARK-19581 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark development or standalone mode on windows or linux. >Reporter: Barry Becker >Priority: Minor > > The severity of this bug is high (because nothing should cause spark to crash > like this) but the priority may be low (because there is an easy workaround). > In our application, a user can select features and a target to run the > NaiveBayes inducer. If columns have too many values or all one value, they > will be removed before we call the inducer to create the model. As a result, > there are some cases, where all the features may get removed. When this > happens, executors will crash and get restarted (if on a cluster) or spark > will crash and need to be manually restarted (if in development mode). > It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well > when it is encountered. I emits this vague error : > ** On entry to DGEMV parameter number 6 had an illegal value > and terminates. > My code looks like this: > {code} >val predictions = model.transform(testData) // Make predictions > // figure out how many were correctly predicted > val numCorrect = predictions.filter(new Column(actualTarget) === new > Column(PREDICTION_LABEL_COLUMN)).count() > val numIncorrect = testRowCount - numCorrect > {code} > The failure is at the line that does the count, but it is not the count that > causes the problem, it is the model.transform step (where the model contains > the NaiveBayes classifier). > Here is the stack trace (in development mode): > {code} > [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] > [akka://JobServer/user/context-supervisor/sql-context] - done making > predictions in 232 > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505) > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event > SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@1f6c4a29) > [2017-02-13 06:28:40,508] ERROR .scheduler.LiveListenerBus []
[jira] [Resolved] (SPARK-20794) Spark show() command on dataset does not retrieve consistent rows from DASHDB data source
[ https://issues.apache.org/jira/browse/SPARK-20794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20794. --- Resolution: Invalid It's a question, so belongs on the mailing list. I think it's a DASHDB question. show is just picking from the first partition of the underlying data source. > Spark show() command on dataset does not retrieve consistent rows from DASHDB > data source > - > > Key: SPARK-20794 > URL: https://issues.apache.org/jira/browse/SPARK-20794 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Sahana HA >Priority: Minor > > When the user creates the dataframe from DASHDB data source (which is a > relational database) and executes df.show(5) it returns different result sets > or rows during each execution. We are aware that show(5) will pick the first > 5 rows from existing partition and hence it is not guaranteed to be > consistent across each execution. > However when we try the same show(5) command against S3 storage or > bluemixobject store (non-relational data source) we always get the same > result sets or rows in order, across each execution. > We just wanted to confirm why the difference between DASHDB and other data > source like S3/Bluemixobjectstore ? Is the issue with spark or DASHDB alone ? > or is the inconsistent rows behavior is there for all relational data source ? > Repro snippet: > -- Load the data from dashdb > val dashdb = > sqlContext.read.format("packageName").options(dashdbreadOptions).load > -- execution #1 > dashdb.show(5) > +++-+---+-+-+--+---+--++ > |PRODUCT_LINE|PRODUCT_TYPE|CUST_ORDER_NUMBER| CITY|STATE| > COUNTRY|GENDER|AGE|MARITAL_STATUS| PROFESSION| > +++-+---+-+-+--+---+--++ > |Personal Accessories| Eyewear| 107861|Rutland| VT|United > States| F| 39| Married| Sales| > | Camping Equipment|Lanterns| 189003| Sydney| NSW| > Australia| F| 20|Single| Hospitality| > | Camping Equipment|Cooking Gear| 107863| Sydney| NSW| > Australia| F| 20|Single| Hospitality| > |Personal Accessories| Eyewear| 189005|Villach| NA| > Austria| F| 37| Married|Professional| > |Personal Accessories| Eyewear| 107865|Villach| NA| > Austria| F| 37| Married|Professional| > +++-+---+-+-+--+---+--++ > only showing top 5 rows > -- execution #2 > dashdb.show(5) > +++-++-+--+--+---+--+---+ > |PRODUCT_LINE|PRODUCT_TYPE|CUST_ORDER_NUMBER|CITY|STATE| > COUNTRY|GENDER|AGE|MARITAL_STATUS| PROFESSION| > +++-++-+--+--+---+--+---+ > |Mountaineering Eq...| Tools| 112835| Portsmouth| > NA|United Kingdom| M| 24|Single| Other| > | Camping Equipment|Cooking Gear| 193902|Jacksonville| FL| > United States| F| 22|Single|Hospitality| > | Camping Equipment| Packs| 112837|Jacksonville| FL| > United States| F| 22|Single|Hospitality| > |Mountaineering Eq...|Rope| 193904|Jacksonville| FL| > United States| F| 31| Married| Other| > | Golf Equipment| Putters| 112839|Jacksonville| FL| > United States| F| 31| Married| Other| > +++-++-+--+--+---+--+---+ > only showing top 5 rows -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20794) Spark show() command on dataset does not retrieve consistent rows from DASHDB data source
Sahana HA created SPARK-20794: - Summary: Spark show() command on dataset does not retrieve consistent rows from DASHDB data source Key: SPARK-20794 URL: https://issues.apache.org/jira/browse/SPARK-20794 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.0.0 Reporter: Sahana HA Priority: Minor When the user creates the dataframe from DASHDB data source (which is a relational database) and executes df.show(5) it returns different result sets or rows during each execution. We are aware that show(5) will pick the first 5 rows from existing partition and hence it is not guaranteed to be consistent across each execution. However when we try the same show(5) command against S3 storage or bluemixobject store (non-relational data source) we always get the same result sets or rows in order, across each execution. We just wanted to confirm why the difference between DASHDB and other data source like S3/Bluemixobjectstore ? Is the issue with spark or DASHDB alone ? or is the inconsistent rows behavior is there for all relational data source ? Repro snippet: -- Load the data from dashdb val dashdb = sqlContext.read.format("packageName").options(dashdbreadOptions).load -- execution #1 dashdb.show(5) +++-+---+-+-+--+---+--++ |PRODUCT_LINE|PRODUCT_TYPE|CUST_ORDER_NUMBER| CITY|STATE| COUNTRY|GENDER|AGE|MARITAL_STATUS| PROFESSION| +++-+---+-+-+--+---+--++ |Personal Accessories| Eyewear| 107861|Rutland| VT|United States| F| 39| Married| Sales| | Camping Equipment|Lanterns| 189003| Sydney| NSW| Australia| F| 20|Single| Hospitality| | Camping Equipment|Cooking Gear| 107863| Sydney| NSW| Australia| F| 20|Single| Hospitality| |Personal Accessories| Eyewear| 189005|Villach| NA| Austria| F| 37| Married|Professional| |Personal Accessories| Eyewear| 107865|Villach| NA| Austria| F| 37| Married|Professional| +++-+---+-+-+--+---+--++ only showing top 5 rows -- execution #2 dashdb.show(5) +++-++-+--+--+---+--+---+ |PRODUCT_LINE|PRODUCT_TYPE|CUST_ORDER_NUMBER|CITY|STATE| COUNTRY|GENDER|AGE|MARITAL_STATUS| PROFESSION| +++-++-+--+--+---+--+---+ |Mountaineering Eq...| Tools| 112835| Portsmouth| NA|United Kingdom| M| 24|Single| Other| | Camping Equipment|Cooking Gear| 193902|Jacksonville| FL| United States| F| 22|Single|Hospitality| | Camping Equipment| Packs| 112837|Jacksonville| FL| United States| F| 22|Single|Hospitality| |Mountaineering Eq...|Rope| 193904|Jacksonville| FL| United States| F| 31| Married| Other| | Golf Equipment| Putters| 112839|Jacksonville| FL| United States| F| 31| Married| Other| +++-++-+--+--+---+--+---+ only showing top 5 rows -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20793) cache table will not refresh after insert data to some broadcast table
[ https://issues.apache.org/jira/browse/SPARK-20793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] du closed SPARK-20793. -- Resolution: Not A Problem > cache table will not refresh after insert data to some broadcast table > -- > > Key: SPARK-20793 > URL: https://issues.apache.org/jira/browse/SPARK-20793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: du > > run below sql in spark-sql or beeline > create table t4(c1 int,c2 int); > insert into table t4 select 1,2; > insert into table t4 select 2,2; > create table t5(c1 int,c2 int); > insert into table t5 select 2,3; > cache table t3 as select t4.c1 as c1,t4.c2 as c2,t5.c1 as c3, t5.c2 as c4 > from t4 join t5 on t4.c2=t5.c1; > cache table t6 as select * from t4 join t3 on t4.c2=t3.c2; > select * from t3; > select * from t6; > insert into table t5 select 2,4; > select * from t3; > select * from t6; > after insert table t5, t3 and t6 are not include data 2,4 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20793) cache table will not refresh after insert data to some broadcast table
[ https://issues.apache.org/jira/browse/SPARK-20793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] du updated SPARK-20793: --- Description: run below sql in spark-sql or beeline create table t4(c1 int,c2 int); insert into table t4 select 1,2; insert into table t4 select 2,2; create table t5(c1 int,c2 int); insert into table t5 select 2,3; cache table t3 as select t4.c1 as c1,t4.c2 as c2,t5.c1 as c3, t5.c2 as c4 from t4 join t5 on t4.c2=t5.c1; cache table t6 as select * from t4 join t3 on t4.c2=t3.c2; select * from t3; select * from t6; insert into table t5 select 2,4; select * from t3; select * from t6; after insert table t5, t3 and t6 are not include data 2,4 was: create table t4(c1 int,c2 int); insert into table t4 select 1,2; insert into table t4 select 2,2; create table t5(c1 int,c2 int); insert into table t5 select 2,3; run below sql cache table t3 as select t4.c1 as c1,t4.c2 as c2,t5.c1 as c3, t5.c2 as c4 from t4 join t5 on t4.c2=t5.c1; cache table t6 as select * from t4 join t3 on t4.c2=t3.c2; select * from t3; select * from t6; insert into table t5 select 2,4; select * from t3; select * from t6; after insert table t5, t3 and t6 are not include data 2,4 > cache table will not refresh after insert data to some broadcast table > -- > > Key: SPARK-20793 > URL: https://issues.apache.org/jira/browse/SPARK-20793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: du > > run below sql in spark-sql or beeline > create table t4(c1 int,c2 int); > insert into table t4 select 1,2; > insert into table t4 select 2,2; > create table t5(c1 int,c2 int); > insert into table t5 select 2,3; > cache table t3 as select t4.c1 as c1,t4.c2 as c2,t5.c1 as c3, t5.c2 as c4 > from t4 join t5 on t4.c2=t5.c1; > cache table t6 as select * from t4 join t3 on t4.c2=t3.c2; > select * from t3; > select * from t6; > insert into table t5 select 2,4; > select * from t3; > select * from t6; > after insert table t5, t3 and t6 are not include data 2,4 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org