[jira] [Issue Comment Deleted] (SPARK-20785) Spark should provide jump links and add (count) in the SQL web ui.

2017-05-18 Thread guoxiaolongzte (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-20785:
---
Comment: was deleted

(was: [~srowen]
Help deal with the issue)

> Spark should  provide jump links and add (count) in the SQL web ui.
> ---
>
> Key: SPARK-20785
> URL: https://issues.apache.org/jira/browse/SPARK-20785
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> it provide links that jump to Running Queries,Completed Queries and Failed 
> Queries.
> it  add (count) about Running Queries,Completed Queries and Failed Queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20785) Spark should provide jump links and add (count) in the SQL web ui.

2017-05-18 Thread guoxiaolongzte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016989#comment-16016989
 ] 

guoxiaolongzte commented on SPARK-20785:


[~srowen]
Help deal with the issue

> Spark should  provide jump links and add (count) in the SQL web ui.
> ---
>
> Key: SPARK-20785
> URL: https://issues.apache.org/jira/browse/SPARK-20785
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
>
> it provide links that jump to Running Queries,Completed Queries and Failed 
> Queries.
> it  add (count) about Running Queries,Completed Queries and Failed Queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20806) Launcher:redundant code,invalid branch of judgment

2017-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20806.
---
Resolution: Invalid

It's hard to understand what you're describing.
But I looked at the code you describe and the logic is fine. The argument 
failfNotFound is not always true/false.

> Launcher:redundant code,invalid branch of judgment
> --
>
> Key: SPARK-20806
> URL: https://issues.apache.org/jira/browse/SPARK-20806
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.1.1
>Reporter: Phoenix_Dad
>
>   org.apache.spark.launcher.CommandBuilderUtils
>   In findJarsDir function, there is an “if or else” branch .
>   the first input argument of 'checkState' in 'if' subclause is always true, 
> so 'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20806) Launcher:redundant code,invalid branch of judgment

2017-05-18 Thread Phoenix_Dad (JIRA)
Phoenix_Dad created SPARK-20806:
---

 Summary: Launcher:redundant code,invalid branch of judgment
 Key: SPARK-20806
 URL: https://issues.apache.org/jira/browse/SPARK-20806
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Submit
Affects Versions: 2.1.1
Reporter: Phoenix_Dad


  org.apache.spark.launcher.CommandBuilderUtils
  In findJarsDir function, there is an “if or else” branch .
  the first input argument of 'checkState' in 'if' subclause is always true, so 
'checkState' is useless here



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20805) updated updateP in SVD++ is error

2017-05-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016969#comment-16016969
 ] 

Sean Owen commented on SPARK-20805:
---

This is hard to understand [~BoLing]. Can you open a PR with the proposed 
change, and explain the effect of the problem more clearly?

> updated  updateP in SVD++ is error
> --
>
> Key: SPARK-20805
> URL: https://issues.apache.org/jira/browse/SPARK-20805
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1, 2.1.1
>Reporter: BoLing
>
> In algorithm svd++, we all known that the usr._2 store the value of  pu + 
> |N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of 
> updateP,updateQ and updateY. At the beginning,the cycle iteration update the 
> part of y in usr._2, but pu is never updated. so we should fix the 
> sendMessageToSrcFunction in sendMsgTrainF. the old code is 
> ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * 
> conf.gamma1)). if we fix like that ctx.sendToSrc((updateP, updateP, (err - 
> conf.gamma6 * usr._3) * conf.gamma1)), it maybe arrive the effect we want.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20805) updated updateP in SVD++ is error

2017-05-18 Thread BoLing (JIRA)
BoLing created SPARK-20805:
--

 Summary: updated  updateP in SVD++ is error
 Key: SPARK-20805
 URL: https://issues.apache.org/jira/browse/SPARK-20805
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.1.1, 1.6.1
Reporter: BoLing


In algorithm svd++, we all known that the usr._2 store the value of  pu + 
|N(u)|^(-0.5)*sum(y); the function sendMsgTrainF compute the updated value of 
updateP,updateQ and updateY. At the beginning,the cycle iteration update the 
part of y in usr._2, but pu is never updated. so we should fix the 
sendMessageToSrcFunction in sendMsgTrainF. the old code is 
ctx.sendToSrc((updateP, updateY, (err - conf.gamma6 * usr._3) * conf.gamma1)). 
if we fix like that ctx.sendToSrc((updateP, updateP, (err - conf.gamma6 * 
usr._3) * conf.gamma1)), it maybe arrive the effect we want.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20803) KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not

2017-05-18 Thread Bettadapura Srinath Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bettadapura Srinath Sharma updated SPARK-20803:
---
Description: 
When data is NOT normally distributed (correct behavior):
This code:
vecRDD = sc.parallelize(colVec)
kd = KernelDensity()
kd.setSample(vecRDD)
kd.setBandwidth(3.0)
# Find density estimates for the given values
densities = kd.estimate(samplePoints)
produces:
17/05/18 15:40:36 INFO SparkContext: Starting job: aggregate at 
KernelDensity.scala:92
17/05/18 15:40:36 INFO DAGScheduler: Got job 21 (aggregate at 
KernelDensity.scala:92) with 1 output partitions
17/05/18 15:40:36 INFO DAGScheduler: Final stage: ResultStage 24 (aggregate at 
KernelDensity.scala:92)
17/05/18 15:40:36 INFO DAGScheduler: Parents of final stage: List()
17/05/18 15:40:36 INFO DAGScheduler: Missing parents: List()
17/05/18 15:40:36 INFO DAGScheduler: Submitting ResultStage 24 
(MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345), which has 
no missing parents
17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25 stored as values in 
memory (estimated size 6.6 KB, free 413.6 MB)
17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes 
in memory (estimated size 3.6 KB, free 413.6 MB)
17/05/18 15:40:36 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 
192.168.0.115:38645 (size: 3.6 KB, free: 413.9 MB)
17/05/18 15:40:36 INFO SparkContext: Created broadcast 25 from broadcast at 
DAGScheduler.scala:996
17/05/18 15:40:36 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at 
PythonMLLibAPI.scala:1345)
17/05/18 15:40:36 INFO TaskSchedulerImpl: Adding task set 24.0 with 1 tasks
17/05/18 15:40:36 INFO TaskSetManager: Starting task 0.0 in stage 24.0 (TID 24, 
localhost, executor driver, partition 0, PROCESS_LOCAL, 96186 bytes)
17/05/18 15:40:36 INFO Executor: Running task 0.0 in stage 24.0 (TID 24)
17/05/18 15:40:37 INFO PythonRunner: Times: total = 66, boot = -1831, init = 
1844, finish = 53
17/05/18 15:40:37 INFO Executor: Finished task 0.0 in stage 24.0 (TID 24). 2476 
bytes result sent to driver
17/05/18 15:40:37 INFO DAGScheduler: ResultStage 24 (aggregate at 
KernelDensity.scala:92) finished in 1.001 s
17/05/18 15:40:37 INFO TaskSetManager: Finished task 0.0 in stage 24.0 (TID 24) 
in 1004 ms on localhost (executor driver) (1/1)
17/05/18 15:40:37 INFO TaskSchedulerImpl: Removed TaskSet 24.0, whose tasks 
have all completed, from pool 
17/05/18 15:40:37 INFO DAGScheduler: Job 21 finished: aggregate at 
KernelDensity.scala:92, took 1.136263 s
17/05/18 15:40:37 INFO BlockManagerInfo: Removed broadcast_25_piece0 on 
192.168.0.115:38645 in memory (size: 3.6 KB, free: 413.9 MB)
5.6654703477e-05,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001
,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,

But if Data IS normally distributed:

I see:
17/05/18 15:50:16 ERROR Executor: Exception in task 0.0 in stage 24.0 (TID 24)
net.razorvine.pickle.PickleException: expected zero arguments for construction 
of ClassDict (for numpy.dtype)

On Scala, the correct result is:
Code:
vecRDD = sc.parallelize(colVec)
kd = new KernelDensity().setSample(vecRDD).setBandwidth(3.0)

// Find density estimates for the given values
densities = kd.estimate(samplePoints)

[0.04113814235801906,1.0994865517293571E-163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0

[jira] [Created] (SPARK-20804) Join with null safe equality fails with AnalysisException

2017-05-18 Thread koert kuipers (JIRA)
koert kuipers created SPARK-20804:
-

 Summary: Join with null safe equality fails with AnalysisException
 Key: SPARK-20804
 URL: https://issues.apache.org/jira/browse/SPARK-20804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: org.apache.spark#spark-sql_2.11;2.3.0-SNAPSHOT from asf 
snapshots, Mon May 15 08:09:18 EDT 2017
Reporter: koert kuipers
Priority: Minor


{noformat}
val x = Seq(("a", 1), ("a", 2), (null, 1)).toDF("k", "v")
val sums = x.groupBy($"k").agg(sum($"v") as "sum")
x
  .join(sums, x("k") <=> sums("k"))
  .drop(sums("k"))
  .show
{noformat}
gives:
{noformat}
  org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER 
join between logical plans
Project [_2#54 AS v#57]
+- LocalRelation [_1#53, _2#54]
and
Aggregate [k#69], [k#69, sum(cast(v#70 as bigint)) AS sum#65L]
+- Project [_1#53 AS k#69, _2#54 AS v#70]
   +- LocalRelation [_1#53, _2#54]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
  at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1081)
  at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1078)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
  at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1078)
  at 
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1063)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
  at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
  at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:79)
  at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:79)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:85)
  at 
org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:81)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
  at 
org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2901)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2238)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2451)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:680)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:639)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:648)
{noformat}


[jira] [Assigned] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.

2017-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20801:


Assignee: Apache Spark

> Store accurate size of blocks in MapStatus when it's above threshold.
> -
>
> Key: SPARK-20801
> URL: https://issues.apache.org/jira/browse/SPARK-20801
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: Apache Spark
>
> Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is 
> used to store size of blocks. in HighlyCompressedMapStatus, only average size 
> is stored for non empty blocks. Which is not good for memory control when we 
> shuffle blocks. It makes sense to store the accurate size of block when it's 
> above threshold.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.

2017-05-18 Thread jin xing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jin xing updated SPARK-20801:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-19659

> Store accurate size of blocks in MapStatus when it's above threshold.
> -
>
> Key: SPARK-20801
> URL: https://issues.apache.org/jira/browse/SPARK-20801
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>
> Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is 
> used to store size of blocks. in HighlyCompressedMapStatus, only average size 
> is stored for non empty blocks. Which is not good for memory control when we 
> shuffle blocks. It makes sense to store the accurate size of block when it's 
> above threshold.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.

2017-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20801:


Assignee: (was: Apache Spark)

> Store accurate size of blocks in MapStatus when it's above threshold.
> -
>
> Key: SPARK-20801
> URL: https://issues.apache.org/jira/browse/SPARK-20801
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>
> Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is 
> used to store size of blocks. in HighlyCompressedMapStatus, only average size 
> is stored for non empty blocks. Which is not good for memory control when we 
> shuffle blocks. It makes sense to store the accurate size of block when it's 
> above threshold.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.

2017-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016813#comment-16016813
 ] 

Apache Spark commented on SPARK-20801:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/18031

> Store accurate size of blocks in MapStatus when it's above threshold.
> -
>
> Key: SPARK-20801
> URL: https://issues.apache.org/jira/browse/SPARK-20801
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>
> Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is 
> used to store size of blocks. in HighlyCompressedMapStatus, only average size 
> is stored for non empty blocks. Which is not good for memory control when we 
> shuffle blocks. It makes sense to store the accurate size of block when it's 
> above threshold.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-05-18 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016664#comment-16016664
 ] 

Ruslan Dautkhanov commented on SPARK-18838:
---

my 2 cents. Would be nice to explore idea of storing offsets of consumed 
messages in listeners themselves, very much like Kafka consumers.
(based on my limited knowledge of spark event queue listeners, assuming each 
listeners don't depend on each other and can read from the queue 
asynchronously) - so then if one of the "non-critical" listeners can't keep up, 
messages will be lost just for that one listener, and it wouldn't affect rest 
of listeners.
{quote}Alternatively, we could use two queues, one for internal listeners and 
another for external ones{quote}Making a parallel with Kafka again, looks like 
we're talking here about two "topics" 

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-05-18 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016663#comment-16016663
 ] 

Josh Rosen commented on SPARK-18838:


[~bOOmX], do you have CPU-time profiling within each of those listeners? I'm 
wondering why StorageListener is so slow. Although it's not a real solution, I 
bet that we could make a significant constant-factor improvement to 
StorageListener (perhaps by using more imperative Java-style code instead of 
Scala collections).

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20803) KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not

2017-05-18 Thread Bettadapura Srinath Sharma (JIRA)
Bettadapura Srinath Sharma created SPARK-20803:
--

 Summary: KernelDensity.estimate in 
pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException 
when input data is normally distributed (no error when data is not normally 
distributed)
 Key: SPARK-20803
 URL: https://issues.apache.org/jira/browse/SPARK-20803
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 2.1.1
 Environment: Linux version 4.4.14-smp
x86/fpu: Legacy x87 FPU detected.
using command line: 
bash-4.3$ ./bin/spark-submit ~/work/python/Features.py
bash-4.3$ pwd
/home/bsrsharma/spark-2.1.1-bin-hadoop2.7
export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121
Reporter: Bettadapura Srinath Sharma


When data is NOT normally distributed (correct behavior):
This code:
vecRDD = sc.parallelize(colVec)
kd = KernelDensity()
kd.setSample(vecRDD)
kd.setBandwidth(3.0)
# Find density estimates for the given values
densities = kd.estimate(samplePoints)
produces:
17/05/18 15:40:36 INFO SparkContext: Starting job: aggregate at 
KernelDensity.scala:92
17/05/18 15:40:36 INFO DAGScheduler: Got job 21 (aggregate at 
KernelDensity.scala:92) with 1 output partitions
17/05/18 15:40:36 INFO DAGScheduler: Final stage: ResultStage 24 (aggregate at 
KernelDensity.scala:92)
17/05/18 15:40:36 INFO DAGScheduler: Parents of final stage: List()
17/05/18 15:40:36 INFO DAGScheduler: Missing parents: List()
17/05/18 15:40:36 INFO DAGScheduler: Submitting ResultStage 24 
(MapPartitionsRDD[44] at mapPartitions at PythonMLLibAPI.scala:1345), which has 
no missing parents
17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25 stored as values in 
memory (estimated size 6.6 KB, free 413.6 MB)
17/05/18 15:40:36 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes 
in memory (estimated size 3.6 KB, free 413.6 MB)
17/05/18 15:40:36 INFO BlockManagerInfo: Added broadcast_25_piece0 in memory on 
192.168.0.115:38645 (size: 3.6 KB, free: 413.9 MB)
17/05/18 15:40:36 INFO SparkContext: Created broadcast 25 from broadcast at 
DAGScheduler.scala:996
17/05/18 15:40:36 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 24 (MapPartitionsRDD[44] at mapPartitions at 
PythonMLLibAPI.scala:1345)
17/05/18 15:40:36 INFO TaskSchedulerImpl: Adding task set 24.0 with 1 tasks
17/05/18 15:40:36 INFO TaskSetManager: Starting task 0.0 in stage 24.0 (TID 24, 
localhost, executor driver, partition 0, PROCESS_LOCAL, 96186 bytes)
17/05/18 15:40:36 INFO Executor: Running task 0.0 in stage 24.0 (TID 24)
17/05/18 15:40:37 INFO PythonRunner: Times: total = 66, boot = -1831, init = 
1844, finish = 53
17/05/18 15:40:37 INFO Executor: Finished task 0.0 in stage 24.0 (TID 24). 2476 
bytes result sent to driver
17/05/18 15:40:37 INFO DAGScheduler: ResultStage 24 (aggregate at 
KernelDensity.scala:92) finished in 1.001 s
17/05/18 15:40:37 INFO TaskSetManager: Finished task 0.0 in stage 24.0 (TID 24) 
in 1004 ms on localhost (executor driver) (1/1)
17/05/18 15:40:37 INFO TaskSchedulerImpl: Removed TaskSet 24.0, whose tasks 
have all completed, from pool 
17/05/18 15:40:37 INFO DAGScheduler: Job 21 finished: aggregate at 
KernelDensity.scala:92, took 1.136263 s
17/05/18 15:40:37 INFO BlockManagerInfo: Removed broadcast_25_piece0 on 
192.168.0.115:38645 in memory (size: 3.6 KB, free: 413.9 MB)
5.6654703477e-05,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001
,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,
0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,0.000100010001,

But if Data IS n

[jira] [Commented] (SPARK-20784) Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() in YARN client mode

2017-05-18 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016600#comment-16016600
 ] 

Hyukjin Kwon commented on SPARK-20784:
--

Thank you for checking this out.

> Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() 
> in YARN client mode
> -
>
> Key: SPARK-20784
> URL: https://issues.apache.org/jira/browse/SPARK-20784
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Mathieu D
>
> Spark hangs and stop executing any job or task (v2.0.2).
> Web UI shows *0 active stages* and *0 active task* on executors, although a 
> driver thread is clearly working/finishing a stage (see below).
> Our application runs several spark contexts for several users in parallel in 
> threads. spark version 2.0.2, yarn-client
> Extract of thread stack below.
> {noformat}
> "ForkJoinPool-1-worker-0" #107 daemon prio=5 os_prio=0 tid=0x7fddf0005800 
> nid=0x484 waiting on condition [0x7fddd0bf
> 6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00078c232760> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at 
> org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:212)
> at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
> at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
> at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
> at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
> at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
> at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
> at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30)
> at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62)
> at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
> at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
> at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
> at 
> org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
> at 
> org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scal

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-05-18 Thread Antoine PRANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016579#comment-16016579
 ] 

Antoine PRANG commented on SPARK-18838:
---

[~joshrosen][~sitalke...@gmail.com] I have measured the listener execution time 
on my test case.
You can ind below for each listener the average percentage of the total 
execution time for a message (I have instrumented ListenerBus)
EnvironmentListener   :0.0
EventLoggingListener  :48.2
ExecutorsListener :0.1
HeartbeatReceiver :0.0
JobProgressListener   :6.9
PepperdataSparkListener   :2.7
RDDOperationGraphListener :0.1  
StorageListener   :38.3
StorageStatusListener :0.4
The execution time is concentrated on 2 Listeners: EventLoggingListener, 
StorageListener.
I think that putting parallelization at the listener bus is not a so good idea. 
Duplicating the messages in 2 queues will change the current synchronization 
contract (all listeners receive each message in the same time, they are ahead 
or behind other listener from 1 and only 1 message).
For me the best idea would be to keep the listener bus as simple as now (N 
producers - 1 consumer) to be able to take advantage of that to dequeue as fast 
as possible and introduce parallelisation at the listener level - being aware 
of the synchronization contract - when it is possible. The  
EventLoggingListener can for example be executed asynchronously.
I am doing a commit to do that right now.

   

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19076) Upgrade Hive dependence to Hive 2.x

2017-05-18 Thread William Handy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016309#comment-16016309
 ] 

William Handy commented on SPARK-19076:
---

It seems like it was decided that this was too difficult, but I wanted to point 
out that hive 2.1 has multithreaded writes with settings hive.mv.files.thread 
and hive.metastore.fshandler.threads. If you happen to be using spark on S3, 
these settings would be a significant performance boost.

There are several articles talking about using these settings in the context of 
"Hive on Spark", when I want to see them in "Hive _in_ Spark" instead :-/

> Upgrade Hive dependence to Hive 2.x
> ---
>
> Key: SPARK-19076
> URL: https://issues.apache.org/jira/browse/SPARK-19076
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dapeng Sun
>
> Currently the upstream Spark depends on Hive 1.2.1 to build package, and Hive 
> 2.0 has been released in February 2016, Hive 2.0.1 and 2.1.0  also released 
> for a long time, at Spark side, it is better to support Hive 2.0 and above.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20389) Upgrade kryo to fix NegativeArraySizeException

2017-05-18 Thread Louis Bergelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016305#comment-16016305
 ] 

Louis Bergelson commented on SPARK-20389:
-

What's the process for evaluating the effect on spark apps?  As far as I can 
tell the changes from 3->4 will mean that data serialized with an older version 
of kryo will not be loadable by a new version of kryo unless you run with a 
compatibility option configured.  Hopefully most apps aren't storing data 
that's been serialized by kryo and only using it for serialization between 
processes.  I don't know if this has knock on effects for things like parquet 
though, does it use kryo?   

We have a recurrent issue when serializing large objects that is fixed in kryo 
4 and would really like to see spark updated.

> Upgrade kryo to fix NegativeArraySizeException
> --
>
> Key: SPARK-20389
> URL: https://issues.apache.org/jira/browse/SPARK-20389
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.1.0
> Environment: Linux, Centos7, jdk8
>Reporter: Georg Heiler
>
> I am experiencing an issue with Kryo when writing parquet files. Similar to 
> https://github.com/broadinstitute/gatk/issues/1524 a 
> NegativeArraySizeException occurs. Apparently this is fixed in a current Kryo 
> version. Spark is still using the very old 3.3 Kryo. 
> Can you please upgrade to a fixed Kryo version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-18 Thread kobefeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Affects Version/s: (was: 2.1.0)
   2.1.1

> Data missing after insert overwrite table partition which is created on 
> specific location
> -
>
> Key: SPARK-20663
> URL: https://issues.apache.org/jira/browse/SPARK-20663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: kobefeng
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.
> {code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>  > id bigint,
>  > name string,
>  > dt string
>  > ) using parquet options ('compression'='snappy', 
> 'path'='/user/kofeng/partitioned_table')
>  > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
> /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, 
> data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r--   3 kofeng kofeng 0 2017-05-08 17:06 
> /user/kofeng/partitioned_table/_SUCCESS
> 
> 
> -- Then drop this partition and use hive to add partition and insert 
> overwrite this partition data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists 
> partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive insert overwrite that specific location successfully.
> hive> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test;
> hive> select * from kofeng.partitioned_table;
> OK
> 123   kofeng  20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/20170507
> -rwxr-xr-x   3 kofeng kofeng   338 2017-05-08 17:26 
> /user/kofeng/partitioned_table/20170507/00_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location

2017-05-18 Thread kobefeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kobefeng updated SPARK-20663:
-
Labels:   (was: easyfix)

> Data missing after insert overwrite table partition which is created on 
> specific location
> -
>
> Key: SPARK-20663
> URL: https://issues.apache.org/jira/browse/SPARK-20663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: kobefeng
>
> Use spark sql to create partition table first, and alter table by adding 
> partition on specific location, then insert overwrite into this partition by 
> selection, which will cause data missing compared with HIVE.
> {code:title=partition_table_insert_overwrite.sql|borderStyle=solid}
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>  > id bigint,
>  > name string,
>  > dt string
>  > ) using parquet options ('compression'='snappy', 
> 'path'='/user/kofeng/partitioned_table')
>  > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng  0 2017-05-08 17:00 
> /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, 
> data is missing, job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r--   3 kofeng kofeng 0 2017-05-08 17:06 
> /user/kofeng/partitioned_table/_SUCCESS
> 
> 
> -- Then drop this partition and use hive to add partition and insert 
> overwrite this partition data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists 
> partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists 
> partition(dt='20170507') location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive insert overwrite that specific location successfully.
> hive> insert overwrite table kofeng.partitioned_table 
> partition(dt='20170507') select 123 as id, "kofeng" as name from kofeng.test;
> hive> select * from kofeng.partitioned_table;
> OK
> 123   kofeng  20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/20170507
> -rwxr-xr-x   3 kofeng kofeng   338 2017-05-08 17:26 
> /user/kofeng/partitioned_table/20170507/00_0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16627) --jars doesn't work in Mesos mode

2017-05-18 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016253#comment-16016253
 ] 

Michael Gummelt commented on SPARK-16627:
-

I'm not completely sure, but I believe that the dispatcher is correctly setting 
{{spark.jars}}, but due to SPARK-10643, the driver is not recognizing the 
remote jar URL.

> --jars doesn't work in Mesos mode
> -
>
> Key: SPARK-16627
> URL: https://issues.apache.org/jira/browse/SPARK-16627
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Reporter: Michael Gummelt
>
> Definitely doesn't work in cluster mode.  Might not work in client mode 
> either.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20802) kolmogorovSmirnovTest in pyspark.mllib.stat.Statistics throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not nor

2017-05-18 Thread Bettadapura Srinath Sharma (JIRA)
Bettadapura Srinath Sharma created SPARK-20802:
--

 Summary: kolmogorovSmirnovTest in pyspark.mllib.stat.Statistics 
throws net.razorvine.pickle.PickleException when input data is normally 
distributed (no error when data is not normally distributed)
 Key: SPARK-20802
 URL: https://issues.apache.org/jira/browse/SPARK-20802
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 2.1.1
 Environment: Linux version 4.4.14-smp
x86/fpu: Legacy x87 FPU detected.
using command line: 
bash-4.3$ ./bin/spark-submit ~/work/python/Features.py
bash-4.3$ pwd
/home/bsrsharma/spark-2.1.1-bin-hadoop2.7
export JAVA_HOME=/home/bsrsharma/jdk1.8.0_121

Reporter: Bettadapura Srinath Sharma


In Scala,(correct behavior)
code:
testResult = Statistics.kolmogorovSmirnovTest(vecRDD, "norm", means(j), 
stdDev(j))
produces:
17/05/18 10:52:53 INFO FeatureLogger: Kolmogorov-Smirnov test summary:
degrees of freedom = 0 
statistic = 0.005495681749849268 
pValue = 0.9216108887428276 
No presumption against null hypothesis: Sample follows theoretical distribution.

in python (incorrect behavior):
the code:
testResult = Statistics.kolmogorovSmirnovTest(vecRDD, 'norm', numericMean[j], 
numericSD[j])

causes this error:
17/05/17 21:59:23 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
net.razorvine.pickle.PickleException: expected zero arguments for construction 
of ClassDict (for numpy.dtype)
 





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12559) Cluster mode doesn't work with --packages

2017-05-18 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016243#comment-16016243
 ] 

Michael Gummelt commented on SPARK-12559:
-

I changed the title from "Standalone cluster mode" to "cluster mode", since 
--packages doesn't work with any cluster mode.

> Cluster mode doesn't work with --packages
> -
>
> Key: SPARK-12559
> URL: https://issues.apache.org/jira/browse/SPARK-12559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>
> From the mailing list:
> {quote}
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
> {quote}
> The problem is that we currently don't upload jars to the cluster. It seems 
> to fix this we either (1) do upload jars, or (2) just run the packages code 
> on the driver side. I slightly prefer (2) because it's simpler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12559) Cluster mode doesn't work with --packages

2017-05-18 Thread Michael Gummelt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-12559:

Summary: Cluster mode doesn't work with --packages  (was: Standalone 
cluster mode doesn't work with --packages)

> Cluster mode doesn't work with --packages
> -
>
> Key: SPARK-12559
> URL: https://issues.apache.org/jira/browse/SPARK-12559
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Andrew Or
>
> From the mailing list:
> {quote}
> Another problem I ran into that you also might is that --packages doesn't
> work with --deploy-mode cluster.  It downloads the packages to a temporary
> location on the node running spark-submit, then passes those paths to the
> node that is running the Driver, but since that isn't the same machine, it
> can't find anything and fails.  The driver process *should* be the one
> doing the downloading, but it isn't. I ended up having to create a fat JAR
> with all of the dependencies to get around that one.
> {quote}
> The problem is that we currently don't upload jars to the cluster. It seems 
> to fix this we either (1) do upload jars, or (2) just run the packages code 
> on the driver side. I slightly prefer (2) because it's simpler.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-05-18 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016217#comment-16016217
 ] 

Sital Kedia commented on SPARK-18838:
-

[~joshrosen] - >> Alternatively, we could use two queues, one for internal 
listeners and another for external ones. This wouldn't be as fine-grained as 
thread-per-listener but might buy us a lot of the benefits with perhaps less 
code needed.

Actually that is exactly what my PR is doing. 
https://github.com/apache/spark/pull/16291. I have not been able to work on it 
recently, but you can take a look and let me know how it looks. I can 
prioritize working on it. 

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20776) Fix JobProgressListener perf. problems caused by empty TaskMetrics initialization

2017-05-18 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016168#comment-16016168
 ] 

Ruslan Dautkhanov commented on SPARK-20776:
---

Thank you [~joshrosen]. Would it be possible to backport this patch to Spark 
2.1 as well?

> Fix JobProgressListener perf. problems caused by empty TaskMetrics 
> initialization
> -
>
> Key: SPARK-20776
> URL: https://issues.apache.org/jira/browse/SPARK-20776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.2.0
>
>
> In 
> {code}
> ./bin/spark-shell --master=local[64]
> {code}
> I ran 
> {code}
>   sc.parallelize(1 to 10, 10).count()
> {code}
> and profiled the time spend in the LiveListenerBus event processing thread. I 
> discovered that the majority of the time was being spent constructing empty 
> TaskMetrics instances inside JobProgressListener. As I'll show in a PR, we 
> can slightly simplify the code to remove the need to construct one empty 
> TaskMetrics per onTaskSubmitted event.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results

2017-05-18 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20364.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Parquet predicate pushdown on columns with dots return empty results
> 
>
> Key: SPARK-20364
> URL: https://issues.apache.org/jira/browse/SPARK-20364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.2.0
>
>
> Currently, if there are dots in the column name, predicate pushdown seems 
> being failed in Parquet.
> **With dots**
> {code}
> val path = "/tmp/abcde"
> Seq(Some(1), None).toDF("col.dots").write.parquet(path)
> spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
> {code}
> {code}
> ++
> |col.dots|
> ++
> ++
> {code}
> **Without dots**
> {code}
> val path = "/tmp/abcde2"
> Seq(Some(1), None).toDF("coldots").write.parquet(path)
> spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
> {code}
> {code}
> +---+
> |coldots|
> +---+
> |  1|
> +---+
> {code}
> It seems dot in the column names via {{FilterApi}} tries to separate the 
> field name with dot ({{ColumnPath}} with multiple column paths) whereas the 
> actual column name is {{col.dots}}. (See [FilterApi.java#L71 
> |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71]
>  and it calls 
> [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44].
> I just tried to come up with ways to resolve it and I came up with two as 
> below:
> One is simply to don't push down filters when there are dots in column names 
> so that it reads all and filters in Spark-side.
> The other way creates Spark's {{FilterApi}} for those columns (it seems 
> final) to get always use single column path it in Spark-side (this seems 
> hacky) as we are not pushing down nested columns currently. So, it looks we 
> can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} 
> in this way.
> I just made a rough version of the latter. 
> {code}
> private[parquet] object ParquetColumns {
>   def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = {
> new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with 
> SupportsLtGt
>   }
>   def longColumn(columnPath: String): Column[java.lang.Long] with 
> SupportsLtGt = {
> new Column[java.lang.Long] (
>   ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt
>   }
>   def floatColumn(columnPath: String): Column[java.lang.Float] with 
> SupportsLtGt = {
> new Column[java.lang.Float] (
>   ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt
>   }
>   def doubleColumn(columnPath: String): Column[java.lang.Double] with 
> SupportsLtGt = {
> new Column[java.lang.Double] (
>   ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt
>   }
>   def booleanColumn(columnPath: String): Column[java.lang.Boolean] with 
> SupportsEqNotEq = {
> new Column[java.lang.Boolean] (
>   ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with 
> SupportsEqNotEq
>   }
>   def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = {
> new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with 
> SupportsLtGt
>   }
> }
> {code}
> Both sound not the best. Please let me know if anyone has a better idea.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results

2017-05-18 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20364:
---

Assignee: Hyukjin Kwon

> Parquet predicate pushdown on columns with dots return empty results
> 
>
> Key: SPARK-20364
> URL: https://issues.apache.org/jira/browse/SPARK-20364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.2.0
>
>
> Currently, if there are dots in the column name, predicate pushdown seems 
> being failed in Parquet.
> **With dots**
> {code}
> val path = "/tmp/abcde"
> Seq(Some(1), None).toDF("col.dots").write.parquet(path)
> spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
> {code}
> {code}
> ++
> |col.dots|
> ++
> ++
> {code}
> **Without dots**
> {code}
> val path = "/tmp/abcde2"
> Seq(Some(1), None).toDF("coldots").write.parquet(path)
> spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
> {code}
> {code}
> +---+
> |coldots|
> +---+
> |  1|
> +---+
> {code}
> It seems dot in the column names via {{FilterApi}} tries to separate the 
> field name with dot ({{ColumnPath}} with multiple column paths) whereas the 
> actual column name is {{col.dots}}. (See [FilterApi.java#L71 
> |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71]
>  and it calls 
> [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44].
> I just tried to come up with ways to resolve it and I came up with two as 
> below:
> One is simply to don't push down filters when there are dots in column names 
> so that it reads all and filters in Spark-side.
> The other way creates Spark's {{FilterApi}} for those columns (it seems 
> final) to get always use single column path it in Spark-side (this seems 
> hacky) as we are not pushing down nested columns currently. So, it looks we 
> can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} 
> in this way.
> I just made a rough version of the latter. 
> {code}
> private[parquet] object ParquetColumns {
>   def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = {
> new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with 
> SupportsLtGt
>   }
>   def longColumn(columnPath: String): Column[java.lang.Long] with 
> SupportsLtGt = {
> new Column[java.lang.Long] (
>   ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt
>   }
>   def floatColumn(columnPath: String): Column[java.lang.Float] with 
> SupportsLtGt = {
> new Column[java.lang.Float] (
>   ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt
>   }
>   def doubleColumn(columnPath: String): Column[java.lang.Double] with 
> SupportsLtGt = {
> new Column[java.lang.Double] (
>   ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt
>   }
>   def booleanColumn(columnPath: String): Column[java.lang.Boolean] with 
> SupportsEqNotEq = {
> new Column[java.lang.Boolean] (
>   ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with 
> SupportsEqNotEq
>   }
>   def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = {
> new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with 
> SupportsLtGt
>   }
> }
> {code}
> Both sound not the best. Please let me know if anyone has a better idea.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param

2017-05-18 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016116#comment-16016116
 ] 

yuhao yang commented on SPARK-20768:


Thanks for the ping. [~mlnick] We should just treat it as an expert param. 
Normally in python it should be exposed as a Param in my impression.

> PySpark FPGrowth does not expose numPartitions (expert)  param
> --
>
> Key: SPARK-20768
> URL: https://issues.apache.org/jira/browse/SPARK-20768
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. 
> While it is an "expert" param, the general approach elsewhere is to expose 
> these on the Python side (e.g. {{aggregationDepth}} and intermediate storage 
> params in {{ALS}})



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-18 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016061#comment-16016061
 ] 

yuhao yang commented on SPARK-20797:


[~d0evi1] Thanks for reporting the issue and proposal for the fix. Would you 
send a PR for the fix? 

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong

2017-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20796:
-

Assignee: liuzhaokun

> the location of start-master.sh in spark-standalone.md is wrong
> ---
>
> Key: SPARK-20796
> URL: https://issues.apache.org/jira/browse/SPARK-20796
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Assignee: liuzhaokun
>Priority: Trivial
> Fix For: 2.1.2, 2.2.0
>
>
> the location of start-master.sh in spark-standalone.md should be 
> "sbin/start-master.sh" rather than "bin/start-master.sh".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong

2017-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20796.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 18027
[https://github.com/apache/spark/pull/18027]

> the location of start-master.sh in spark-standalone.md is wrong
> ---
>
> Key: SPARK-20796
> URL: https://issues.apache.org/jira/browse/SPARK-20796
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Trivial
> Fix For: 2.2.0, 2.1.2
>
>
> the location of start-master.sh in spark-standalone.md should be 
> "sbin/start-master.sh" rather than "bin/start-master.sh".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20779) The ASF header placed in an incorrect location in some files

2017-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20779.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18012
[https://github.com/apache/spark/pull/18012]

> The ASF header placed in an incorrect location in some files
> 
>
> Key: SPARK-20779
> URL: https://issues.apache.org/jira/browse/SPARK-20779
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.1.0, 2.1.1
>Reporter: zuotingbing
>Priority: Trivial
> Fix For: 2.3.0
>
>
> when i test some examples, i found the license is not at the top in some 
> files.  and it will be best if we update these places of the ASF header to be 
> consistent with other files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20779) The ASF header placed in an incorrect location in some files

2017-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20779:
-

Assignee: zuotingbing

> The ASF header placed in an incorrect location in some files
> 
>
> Key: SPARK-20779
> URL: https://issues.apache.org/jira/browse/SPARK-20779
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.1.0, 2.1.1
>Reporter: zuotingbing
>Assignee: zuotingbing
>Priority: Trivial
> Fix For: 2.3.0
>
>
> when i test some examples, i found the license is not at the top in some 
> files.  and it will be best if we update these places of the ASF header to be 
> consistent with other files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016021#comment-16016021
 ] 

Mitesh commented on SPARK-17867:


Ah I see, thanks [~viirya]. The repartitionByColumns is just a short-cut method 
I created. But I do have some aliasing code changes compared to 2.1, I will try 
to remove those and see if that is whats breaking it.

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2017-05-18 Thread Brian Albright (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015964#comment-16015964
 ] 

Brian Albright commented on SPARK-14492:


Ran into this over the past week.  Here are some gists to illustrate the issue 
as I see it.

pom.xml: https://gist.github.com/MisterSpicy/3ae3205ba09668b8e9b9fc2d1655646e
the test: https://gist.github.com/MisterSpicy/20ef8cc2777dc82e305f29a507ae83d7


> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20801) Store accurate size of blocks in MapStatus when it's above threshold.

2017-05-18 Thread jin xing (JIRA)
jin xing created SPARK-20801:


 Summary: Store accurate size of blocks in MapStatus when it's 
above threshold.
 Key: SPARK-20801
 URL: https://issues.apache.org/jira/browse/SPARK-20801
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: jin xing


Currently, when number of reduces is above 2000, HighlyCompressedMapStatus is 
used to store size of blocks. in HighlyCompressedMapStatus, only average size 
is stored for non empty blocks. Which is not good for memory control when we 
shuffle blocks. It makes sense to store the accurate size of block when it's 
above threshold.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong

2017-05-18 Thread liuzhaokun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015914#comment-16015914
 ] 

liuzhaokun commented on SPARK-20796:


@Sean Owen
I am so sorry about it.And I will take your advice.

> the location of start-master.sh in spark-standalone.md is wrong
> ---
>
> Key: SPARK-20796
> URL: https://issues.apache.org/jira/browse/SPARK-20796
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Trivial
>
> the location of start-master.sh in spark-standalone.md should be 
> "sbin/start-master.sh" rather than "bin/start-master.sh".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20784) Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() in YARN client mode

2017-05-18 Thread Mathieu D (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu D closed SPARK-20784.
-
Resolution: Not A Bug

Oh boy, it was an OOM on the driver. Most of the times, it was silent. I just 
discovered an OOM exception in the middle of the logs in a task-result-getter.
I guess the BroadcastExchange was just waiting for it

> Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() 
> in YARN client mode
> -
>
> Key: SPARK-20784
> URL: https://issues.apache.org/jira/browse/SPARK-20784
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Mathieu D
>
> Spark hangs and stop executing any job or task (v2.0.2).
> Web UI shows *0 active stages* and *0 active task* on executors, although a 
> driver thread is clearly working/finishing a stage (see below).
> Our application runs several spark contexts for several users in parallel in 
> threads. spark version 2.0.2, yarn-client
> Extract of thread stack below.
> {noformat}
> "ForkJoinPool-1-worker-0" #107 daemon prio=5 os_prio=0 tid=0x7fddf0005800 
> nid=0x484 waiting on condition [0x7fddd0bf
> 6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x00078c232760> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at 
> org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:212)
> at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
> at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
> at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
> at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
> at 
> org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
> at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
> at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30)
> at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62)
> at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
> at 
> org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
> at 
> org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
> at 
> org.apache.spark.sql.execution.InputAdapt

[jira] [Commented] (SPARK-20782) Dataset's isCached operator

2017-05-18 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015872#comment-16015872
 ] 

Wenchen Fan commented on SPARK-20782:
-

one alternative is to create a temp view and cache the view, then we can use 
`spark.catalog.isCache(viewName)`

> Dataset's isCached operator
> ---
>
> Key: SPARK-20782
> URL: https://issues.apache.org/jira/browse/SPARK-20782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It'd be very convenient to have {{isCached}} operator that would say whether 
> a query is cached in-memory or not.
> It'd be as simple as the following snippet:
> {code}
> // val q2: DataFrame
> spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015852#comment-16015852
 ] 

Liang-Chi Hsieh commented on SPARK-17867:
-

The above example code can't compile with current codebase. There is no 
repartitionByColumns but only repartition.

{code}
val df = Seq((1, 2, 3, "hi"), (1, 2, 4, "hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartition($"userid")
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartition($"userid")
  .sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")
{code}

The optimized plan looks like:

{code}
Sort [userid#9 ASC NULLS FIRST], false
+- RepartitionByExpression [userid#9], 5
   +- Filter (isnotnull(del#12) && NOT (del#12 = hi))
  +- Aggregate [eventid#10], [first(userid#9, false) AS userid#9, 
eventid#10, first(vk#11, false) AS vk#11, first(del#12, false) AS del#12]
 +- Sort [userid#9 ASC NULLS FIRST, eventid#10 ASC NULLS FIRST, vk#11 
DESC NULLS LAST], false
+- RepartitionByExpression [userid#9], 5
   +- LocalRelation [userid#9, eventid#10, vk#11, del#12]
{code}

The spark plan looks like:

{code}
Sort [userid#9 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(userid#9, 5)
   +- Filter (isnotnull(del#12) && NOT (del#12 = hi))
  +- SortAggregate(key=[eventid#10], functions=[first(userid#9, false), 
first(vk#11, false), first(del#12, false)], output=[userid#9, eventid#10, 
vk#11, del#12])
 +- SortAggregate(key=[eventid#10], functions=[partial_first(userid#9, 
false), partial_first(vk#11, false), partial_first(del#12, false)], 
output=[eventid#10, first#35, valueSet#36, first#37, valueSet#38, first#39, 
valueSet#40])
+- Sort [userid#9 ASC NULLS FIRST, eventid#10 ASC NULLS FIRST, 
vk#11 DESC NULLS LAST], false, 0
   +- Exchange hashpartitioning(userid#9, 5)
  +- LocalTableScan [userid#9, eventid#10, vk#11, del#12]
{code}

Looks like the "del <> 'hi'" filter doesn't be pushed down?

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20775) from_json should also have an API where the schema is specified with a string

2017-05-18 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-20775:
-
Issue Type: Improvement  (was: Bug)

> from_json should also have an API where the schema is specified with a string
> -
>
> Key: SPARK-20775
> URL: https://issues.apache.org/jira/browse/SPARK-20775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>
> Right now you also have to provide a java.util.Map which is not nice for 
> Scala users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20775) from_json should also have an API where the schema is specified with a string

2017-05-18 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015819#comment-16015819
 ] 

Takeshi Yamamuro commented on SPARK-20775:
--

Since I feel this is no a bug, I'll change the issue type.

> from_json should also have an API where the schema is specified with a string
> -
>
> Key: SPARK-20775
> URL: https://issues.apache.org/jira/browse/SPARK-20775
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Burak Yavuz
>
> Right now you also have to provide a java.util.Map which is not nice for 
> Scala users.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20797) mllib lda's LocalLDAModel's save: out of memory.

2017-05-18 Thread d0evi1 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

d0evi1 updated SPARK-20797:
---
Summary: mllib lda's LocalLDAModel's save: out of memory.   (was: mllib lda 
load and save out of memory. )

> mllib lda's LocalLDAModel's save: out of memory. 
> -
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20797) mllib lda load and save out of memory.

2017-05-18 Thread d0evi1 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

d0evi1 updated SPARK-20797:
---
Description: 
when i try online lda model with large text data(nearly 1 billion chinese news' 
abstract), the training step went well, but the save step failed.  something 
like below happened (etc. 1.6.1):

problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
param can fix problem 1, but next will lead problem 2),
problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
allowed: spark.rpc.message.maxSize).

when topics  num is large(set topic num k=200 is ok, but set k=300 failed), and 
vocab size is large(nearly 1000,000) too. this problem will appear.

so i found word2vec's save function is similar to the LocalLDAModel's save 
function :

word2vec's problem (use repartition(1) to save) has been fixed 
[https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
repartition(1). use single partition when save.

word2vec's  save method from latest code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:

  val approxSize = (4L * vectorSize + 15) * numWords
  val nPartitions = ((approxSize / bufferSize) + 1).toInt
  val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
  
spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))


but the code in mllib.clustering.LDAModel's LocalLDAModel's save:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

you'll see:

  val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
  val topics = Range(0, k).map { topicInd =>
Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), topicInd)
  }
  
spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))


refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
topic K, repartition(nPartitions) in the LocalLDAModel's save method, recompile 
the code, deploy the new lda's project with large data on our machine cluster, 
it works.

hopes it will fixed in the next version.

  was:
when i try online lda model with large text data, the training step went well, 
but the save step failed. but  something like below happened (etc. 1.6.1):

1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the param can 
fixed),
2. exceed spark.akka.frameSize. (turning this param too bigger will fail, 
version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize).

when topics  num is large, and vocab size is large too. this problem will 
appear.


so i found this:

https://github.com/apache/spark/pull/9989, word2vec's problem has been fixed, 

this is word2vec's  save method from latest code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:

  val approxSize = (4L * vectorSize + 15) * numWords
  val nPartitions = ((approxSize / bufferSize) + 1).toInt
  val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
  
spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))


but the code in mllib.clustering.LDAModel's save:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

you'll see:

  val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
  val topics = Range(0, k).map { topicInd =>
Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), topicInd)
  }
  
spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))


i try word2vec's save, replace numWords to topic K, repartition(nPartitions), 
recompile the code, deploy the new lda's project with large data on our machine 
cluster, it works.

hopes it will fixed in the next version.


> mllib lda load and save out of memory. 
> ---
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize)

[jira] [Commented] (SPARK-20797) mllib lda load and save out of memory.

2017-05-18 Thread d0evi1 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015813#comment-16015813
 ] 

d0evi1 commented on SPARK-20797:


sorry for my poor english. i rewrite the problem. just one topic: MLlib's 
LocalLDAModel's save.

> mllib lda load and save out of memory. 
> ---
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data(nearly 1 billion chinese 
> news' abstract), the training step went well, but the save step failed.  
> something like below happened (etc. 1.6.1):
> problem 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the 
> param can fix problem 1, but next will lead problem 2),
> problem 2. exceed spark.akka.frameSize. (turning this param too bigger will 
> fail for the reason out of memory,   kill it, version > 2.0.0, exceeds max 
> allowed: spark.rpc.message.maxSize).
> when topics  num is large(set topic num k=200 is ok, but set k=300 failed), 
> and vocab size is large(nearly 1000,000) too. this problem will appear.
> so i found word2vec's save function is similar to the LocalLDAModel's save 
> function :
> word2vec's problem (use repartition(1) to save) has been fixed 
> [https://github.com/apache/spark/pull/9989,], but LocalLDAModel still use:  
> repartition(1). use single partition when save.
> word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's LocalLDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> refer to word2vec's save (repartition(nPartitions)), i replace numWords to 
> topic K, repartition(nPartitions) in the LocalLDAModel's save method, 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20782) Dataset's isCached operator

2017-05-18 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015810#comment-16015810
 ] 

Takeshi Yamamuro commented on SPARK-20782:
--

This short-cut makes some sense to me. WDYT? cc: [~smilegator][~cloud_fan]

> Dataset's isCached operator
> ---
>
> Key: SPARK-20782
> URL: https://issues.apache.org/jira/browse/SPARK-20782
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It'd be very convenient to have {{isCached}} operator that would say whether 
> a query is cached in-memory or not.
> It'd be as simple as the following snippet:
> {code}
> // val q2: DataFrame
> spark.sharedState.cacheManager.lookupCachedData(q2.queryExecution.logical).isDefined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20800) Allow users to set job group when connecting through the SQL thrift server

2017-05-18 Thread Tim Zeyl (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Zeyl updated SPARK-20800:
-
Description: 
It would be useful for users to be able to set the job group through thrift 
server clients like beeline so that jobs in the event log could be grouped 
together logically. This would help in tracking the performance of repeated 
runs of similar sql queries, which could be tagged by the user with the same 
job group id. Currently each sql query, and corresponding job, is assigned a 
random UUID as the job group.

Ideally users could set the job group in two ways:
1. by issuing a sql command prior to their query (for example, SET 
spark.sql.thriftserver.jobGroupID=jobA)
2. by passing a hive conf parameter through beeline to set the job group for 
the session.

Alternatively, if people think the job group needs to be a random UUID for each 
sql query, introducing another parameter that could be written into the job 
properties field of the event log would be helpful for tracking the performance 
of repeated runs.

  was:
It would be useful for users to be able to set the job group through thrift 
server clients like beeline so that jobs in the event log could be grouped 
together logically. This would help in tracking the performance of repeated 
runs of similar sql queries, which could be tagged by the user with the same 
job group id. Currently each sql query, and corresponding job, is assigned a 
random UUID as the job group.

Ideally users could set the job group in two ways:
1. by issuing a sql command prior to their query (for example, SET 
spark.sql.thriftserver.jobGroupID=jobA;)
2. by passing a hive conf parameter through beeline to set the job group for 
the session.

Alternatively, if people think the job group needs to be a random UUID for each 
sql query, introducing another parameter that could be written into the job 
properties field of the event log would be helpful for tracking the performance 
of repeated runs.


> Allow users to set job group when connecting through the SQL thrift server
> --
>
> Key: SPARK-20800
> URL: https://issues.apache.org/jira/browse/SPARK-20800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tim Zeyl
>Priority: Minor
>
> It would be useful for users to be able to set the job group through thrift 
> server clients like beeline so that jobs in the event log could be grouped 
> together logically. This would help in tracking the performance of repeated 
> runs of similar sql queries, which could be tagged by the user with the same 
> job group id. Currently each sql query, and corresponding job, is assigned a 
> random UUID as the job group.
> Ideally users could set the job group in two ways:
> 1. by issuing a sql command prior to their query (for example, SET 
> spark.sql.thriftserver.jobGroupID=jobA)
> 2. by passing a hive conf parameter through beeline to set the job group for 
> the session.
> Alternatively, if people think the job group needs to be a random UUID for 
> each sql query, introducing another parameter that could be written into the 
> job properties field of the event log would be helpful for tracking the 
> performance of repeated runs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20747) Distinct in Aggregate Functions

2017-05-18 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015776#comment-16015776
 ] 

Takeshi Yamamuro commented on SPARK-20747:
--

You mean this query below?
{code}
scala> Seq((1, 1), (1, 1), (1, 1)).toDF("a", "b").createOrReplaceTempView("t")

scala> sql("""select a, avg(distinct b) from t group by a""").show
+---+---+   
|  a|avg(DISTINCT b)|
+---+---+
|  1|1.0|
+---+---+
{code}
It seems these syntaxes already supported though, am I missing something?

> Distinct in Aggregate Functions
> ---
>
> Key: SPARK-20747
> URL: https://issues.apache.org/jira/browse/SPARK-20747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> {noformat}
> AVG ([DISTINCT]|[ALL] )
> MAX ([DISTINCT]|[ALL] )
> MIN ([DISTINCT]|[ALL] )
> SUM ([DISTINCT]|[ALL] )
> {noformat}
> Except COUNT, the DISTINCT clause is not supported by Spark SQL



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20800) Allow users to set job group when connecting through the SQL thrift server

2017-05-18 Thread Tim Zeyl (JIRA)
Tim Zeyl created SPARK-20800:


 Summary: Allow users to set job group when connecting through the 
SQL thrift server
 Key: SPARK-20800
 URL: https://issues.apache.org/jira/browse/SPARK-20800
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Tim Zeyl
Priority: Minor


It would be useful for users to be able to set the job group through thrift 
server clients like beeline so that jobs in the event log could be grouped 
together logically. This would help in tracking the performance of repeated 
runs of similar sql queries, which could be tagged by the user with the same 
job group id. Currently each sql query, and corresponding job, is assigned a 
random UUID as the job group.

Ideally users could set the job group in two ways:
1. by issuing a sql command prior to their query (for example, SET 
spark.sql.thriftserver.jobGroupID=jobA;)
2. by passing a hive conf parameter through beeline to set the job group for 
the session.

Alternatively, if people think the job group needs to be a random UUID for each 
sql query, introducing another parameter that could be written into the job 
properties field of the event log would be helpful for tracking the performance 
of repeated runs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM:
-

I'm seeing a regression from this change, the last `del <> 'hi'` filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last {del <> 'hi'} filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM:
-

I'm seeing a regression from this change, the last "del <> 'hi'" filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last `del <> 'hi'` filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:48 PM:
-

I'm seeing a regression from this change, the last {del <> 'hi'} filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last filter gets pushed down past 
the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:49 PM:
-

I'm seeing a regression from this change, the last "del <> 'hi'" filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last "del <> 'hi'" filter gets 
pushed down past the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772
 ] 

Mitesh commented on SPARK-17867:


I'm seeing a regression from this change, the last filter gets pushed down past 
the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:scala}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17867) Dataset.dropDuplicates (i.e. distinct) should consider the columns with same column name

2017-05-18 Thread Mitesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015772#comment-16015772
 ] 

Mitesh edited comment on SPARK-17867 at 5/18/17 1:47 PM:
-

I'm seeing a regression from this change, the last filter gets pushed down past 
the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:none}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}


was (Author: masterddt):
I'm seeing a regression from this change, the last filter gets pushed down past 
the dropDuplicates aggregation. cc [~cloud_fan]
 
{code:scala}
val df = Seq((1,2,3,"hi"), (1,2,4,"hi"))
  .toDF("userid", "eventid", "vk", "del")
  .filter("userid is not null and eventid is not null and vk is not null")
  .repartitionByColumns(Seq("userid"))
  .sortWithinPartitions(asc("userid"), asc("eventid"), desc("vk"))
  .dropDuplicates("eventid")
  .filter("userid is not null")
  .repartitionByColumns(Seq("userid")).
  sortWithinPartitions(asc("userid"))
  .filter("del <> 'hi'")

// filter should not be pushed down to the local table scan
df.queryExecution.sparkPlan.collect {
  case f @ FilterExec(_, t @ LocalTableScanExec(_, _)) =>
assert(false, s"$f was pushed down to $t")
{code}

> Dataset.dropDuplicates (i.e. distinct) should consider the columns with same 
> column name
> 
>
> Key: SPARK-17867
> URL: https://issues.apache.org/jira/browse/SPARK-17867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We find and get the first resolved attribute from output with the given 
> column name in Dataset.dropDuplicates. When we have the more than one columns 
> with the same name. Other columns are put into aggregation columns, instead 
> of grouping columns. We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter

2017-05-18 Thread Ala Luszczak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ala Luszczak updated SPARK-20798:
-
Description: 
GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption 
that one should first make sure the value is not null before calling the 
getter. This can lead to errors.

An example of generated code:
{noformat}
/* 059 */ final UTF8String fieldName = value.getUTF8String(0);
/* 060 */ if (value.isNullAt(0)) {
/* 061 */   rowWriter1.setNullAt(0);
/* 062 */ } else {
/* 063 */   rowWriter1.write(0, fieldName);
/* 064 */ }
{noformat}


  was:
GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption 
that one should first make sure the value is not null before calling the getter.

An example of generated code:
{noformat}
/* 059 */ final UTF8String fieldName = value.getUTF8String(0);
/* 060 */ if (value.isNullAt(0)) {
/* 061 */   rowWriter1.setNullAt(0);
/* 062 */ } else {
/* 063 */   rowWriter1.write(0, fieldName);
/* 064 */ }
{noformat}



> GenerateUnsafeProjection should check if value is null before calling the 
> getter
> 
>
> Key: SPARK-20798
> URL: https://issues.apache.org/jira/browse/SPARK-20798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>
> GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption 
> that one should first make sure the value is not null before calling the 
> getter. This can lead to errors.
> An example of generated code:
> {noformat}
> /* 059 */ final UTF8String fieldName = value.getUTF8String(0);
> /* 060 */ if (value.isNullAt(0)) {
> /* 061 */   rowWriter1.setNullAt(0);
> /* 062 */ } else {
> /* 063 */   rowWriter1.write(0, fieldName);
> /* 064 */ }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter

2017-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20798:


Assignee: Apache Spark

> GenerateUnsafeProjection should check if value is null before calling the 
> getter
> 
>
> Key: SPARK-20798
> URL: https://issues.apache.org/jira/browse/SPARK-20798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>Assignee: Apache Spark
>
> GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption 
> that one should first make sure the value is not null before calling the 
> getter.
> An example of generated code:
> {noformat}
> /* 059 */ final UTF8String fieldName = value.getUTF8String(0);
> /* 060 */ if (value.isNullAt(0)) {
> /* 061 */   rowWriter1.setNullAt(0);
> /* 062 */ } else {
> /* 063 */   rowWriter1.write(0, fieldName);
> /* 064 */ }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter

2017-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20798:


Assignee: (was: Apache Spark)

> GenerateUnsafeProjection should check if value is null before calling the 
> getter
> 
>
> Key: SPARK-20798
> URL: https://issues.apache.org/jira/browse/SPARK-20798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>
> GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption 
> that one should first make sure the value is not null before calling the 
> getter.
> An example of generated code:
> {noformat}
> /* 059 */ final UTF8String fieldName = value.getUTF8String(0);
> /* 060 */ if (value.isNullAt(0)) {
> /* 061 */   rowWriter1.setNullAt(0);
> /* 062 */ } else {
> /* 063 */   rowWriter1.write(0, fieldName);
> /* 064 */ }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter

2017-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015754#comment-16015754
 ] 

Apache Spark commented on SPARK-20798:
--

User 'ala' has created a pull request for this issue:
https://github.com/apache/spark/pull/18030

> GenerateUnsafeProjection should check if value is null before calling the 
> getter
> 
>
> Key: SPARK-20798
> URL: https://issues.apache.org/jira/browse/SPARK-20798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>
> GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption 
> that one should first make sure the value is not null before calling the 
> getter.
> An example of generated code:
> {noformat}
> /* 059 */ final UTF8String fieldName = value.getUTF8String(0);
> /* 060 */ if (value.isNullAt(0)) {
> /* 061 */   rowWriter1.setNullAt(0);
> /* 062 */ } else {
> /* 063 */   rowWriter1.write(0, fieldName);
> /* 064 */ }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20799) Unable to infer schema for ORC on reading ORC from S3

2017-05-18 Thread Jork Zijlstra (JIRA)
Jork Zijlstra created SPARK-20799:
-

 Summary: Unable to infer schema for ORC on reading ORC from S3
 Key: SPARK-20799
 URL: https://issues.apache.org/jira/browse/SPARK-20799
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Jork Zijlstra


We are getting the following exception: 
{code}org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. 
It must be specified manually.{code}

Combining following factors will cause it:
- Use S3
- Use format ORC
- Don't apply a partitioning on de data
- Embed AWS credentials in the path

The problem is in the PartitioningAwareFileIndex def allFiles()

{code}
leafDirToChildrenFiles.get(qualifiedPath)
  .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
  .getOrElse(Array.empty)
{code}

leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the 
qualifiedPath contains the path WITH credentials.
So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data 
is read and the schema cannot be defined.


Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login 
details. This is insecure and may be unsupported in future., but this should 
not mean that it shouldn't work anymore.

Workaround:
Move the AWS credentials from the path to the SparkSession
{code}
SparkSession.builder
.config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
.config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2017-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20168:


Assignee: Apache Spark

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Apache Spark
>  Labels: kinesis, streaming
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2017-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20168:


Assignee: (was: Apache Spark)

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>  Labels: kinesis, streaming
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2017-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015706#comment-16015706
 ] 

Apache Spark commented on SPARK-20168:
--

User 'yssharma' has created a pull request for this issue:
https://github.com/apache/spark/pull/18029

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>  Labels: kinesis, streaming
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14864) [MLLIB] Implement Doc2Vec

2017-05-18 Thread Rajdeep Mondal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015693#comment-16015693
 ] 

Rajdeep Mondal commented on SPARK-14864:


Sorry to bother. Any progress on this?

> [MLLIB] Implement Doc2Vec
> -
>
> Key: SPARK-14864
> URL: https://issues.apache.org/jira/browse/SPARK-14864
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Peter Mountanos
>Priority: Minor
>
> It would be useful to implement Doc2Vec, as described in the paper 
> [Distributed Representations of Sentences and 
> Documents|https://cs.stanford.edu/~quocle/paragraph_vector.pdf]. Gensim has 
> an implementation [Deep learning with 
> paragraph2vec|https://radimrehurek.com/gensim/models/doc2vec.html]. 
> Le & Mikolov show that when aggregating Word2Vec vector representations for a 
> paragraph/document, it does not perform well for prediction tasks. Instead, 
> they propose the Paragraph Vector implementation, which provides 
> state-of-the-art results on several text classification and sentiment 
> analysis tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20797) mllib lda load and save out of memory.

2017-05-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015657#comment-16015657
 ] 

Sean Owen commented on SPARK-20797:
---

It's not clear what you're describing here. Can you reduce this to focus on the 
specific problem and change?
How many topics?

> mllib lda load and save out of memory. 
> ---
>
> Key: SPARK-20797
> URL: https://issues.apache.org/jira/browse/SPARK-20797
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1, 1.6.3, 2.0.0, 2.0.2, 2.1.1
>Reporter: d0evi1
>
> when i try online lda model with large text data, the training step went 
> well, but the save step failed. but  something like below happened (etc. 
> 1.6.1):
> 1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the param can 
> fixed),
> 2. exceed spark.akka.frameSize. (turning this param too bigger will fail, 
> version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize).
> when topics  num is large, and vocab size is large too. this problem will 
> appear.
> so i found this:
> https://github.com/apache/spark/pull/9989, word2vec's problem has been fixed, 
> this is word2vec's  save method from latest code:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:
>   val approxSize = (4L * vectorSize + 15) * numWords
>   val nPartitions = ((approxSize / bufferSize) + 1).toInt
>   val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
>   
> spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))
> but the code in mllib.clustering.LDAModel's save:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala
> you'll see:
>   val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
>   val topics = Range(0, k).map { topicInd =>
> Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), 
> topicInd)
>   }
>   
> spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))
> i try word2vec's save, replace numWords to topic K, repartition(nPartitions), 
> recompile the code, deploy the new lda's project with large data on our 
> machine cluster, it works.
> hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter

2017-05-18 Thread Ala Luszczak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ala Luszczak updated SPARK-20798:
-
Description: 
GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption 
that one should first make sure the value is not null before calling the getter.

An example of generated code:
{noformat}
/* 059 */ final UTF8String fieldName = value.getUTF8String(0);
/* 060 */ if (value.isNullAt(0)) {
/* 061 */   rowWriter1.setNullAt(0);
/* 062 */ } else {
/* 063 */   rowWriter1.write(0, fieldName);
/* 064 */ }
{noformat}


> GenerateUnsafeProjection should check if value is null before calling the 
> getter
> 
>
> Key: SPARK-20798
> URL: https://issues.apache.org/jira/browse/SPARK-20798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ala Luszczak
>
> GenerateUnsafeProjection.writeStructToBuffer() does not honor the assumption 
> that one should first make sure the value is not null before calling the 
> getter.
> An example of generated code:
> {noformat}
> /* 059 */ final UTF8String fieldName = value.getUTF8String(0);
> /* 060 */ if (value.isNullAt(0)) {
> /* 061 */   rowWriter1.setNullAt(0);
> /* 062 */ } else {
> /* 063 */   rowWriter1.write(0, fieldName);
> /* 064 */ }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong

2017-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20796:
--
Priority: Trivial  (was: Major)

[~liuzhaokun] please don't open a JIRA for these. They're trivial. They are not 
"major improvements" as you've tagged it. Read 
http://spark.apache.org/contributing.html

> the location of start-master.sh in spark-standalone.md is wrong
> ---
>
> Key: SPARK-20796
> URL: https://issues.apache.org/jira/browse/SPARK-20796
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Priority: Trivial
>
> the location of start-master.sh in spark-standalone.md should be 
> "sbin/start-master.sh" rather than "bin/start-master.sh".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20798) GenerateUnsafeProjection should check if value is null before calling the getter

2017-05-18 Thread Ala Luszczak (JIRA)
Ala Luszczak created SPARK-20798:


 Summary: GenerateUnsafeProjection should check if value is null 
before calling the getter
 Key: SPARK-20798
 URL: https://issues.apache.org/jira/browse/SPARK-20798
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Ala Luszczak






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20797) mllib lda load and save out of memory.

2017-05-18 Thread d0evi1 (JIRA)
d0evi1 created SPARK-20797:
--

 Summary: mllib lda load and save out of memory. 
 Key: SPARK-20797
 URL: https://issues.apache.org/jira/browse/SPARK-20797
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.1.1, 2.0.2, 2.0.0, 1.6.3, 1.6.1
Reporter: d0evi1


when i try online lda model with large text data, the training step went well, 
but the save step failed. but  something like below happened (etc. 1.6.1):

1.bigger than spark.kryoserializer.buffer.max.  (turning bigger the param can 
fixed),
2. exceed spark.akka.frameSize. (turning this param too bigger will fail, 
version > 2.0.0, exceeds max allowed: spark.rpc.message.maxSize).

when topics  num is large, and vocab size is large too. this problem will 
appear.


so i found this:

https://github.com/apache/spark/pull/9989, word2vec's problem has been fixed, 

this is word2vec's  save method from latest code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala:

  val approxSize = (4L * vectorSize + 15) * numWords
  val nPartitions = ((approxSize / bufferSize) + 1).toInt
  val dataArray = model.toSeq.map { case (w, v) => Data(w, v) }
  
spark.createDataFrame(dataArray).repartition(nPartitions).write.parquet(Loader.dataPath(path))


but the code in mllib.clustering.LDAModel's save:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala

you'll see:

  val topicsDenseMatrix = topicsMatrix.asBreeze.toDenseMatrix
  val topics = Range(0, k).map { topicInd =>
Data(Vectors.dense((topicsDenseMatrix(::, topicInd).toArray)), topicInd)
  }
  
spark.createDataFrame(topics).repartition(1).write.parquet(Loader.dataPath(path))


i try word2vec's save, replace numWords to topic K, repartition(nPartitions), 
recompile the code, deploy the new lda's project with large data on our machine 
cluster, it works.

hopes it will fixed in the next version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong

2017-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20796:


Assignee: Apache Spark

> the location of start-master.sh in spark-standalone.md is wrong
> ---
>
> Key: SPARK-20796
> URL: https://issues.apache.org/jira/browse/SPARK-20796
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>Assignee: Apache Spark
>
> the location of start-master.sh in spark-standalone.md should be 
> "sbin/start-master.sh" rather than "bin/start-master.sh".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong

2017-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20796:


Assignee: (was: Apache Spark)

> the location of start-master.sh in spark-standalone.md is wrong
> ---
>
> Key: SPARK-20796
> URL: https://issues.apache.org/jira/browse/SPARK-20796
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>
> the location of start-master.sh in spark-standalone.md should be 
> "sbin/start-master.sh" rather than "bin/start-master.sh".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong

2017-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015631#comment-16015631
 ] 

Apache Spark commented on SPARK-20796:
--

User 'liu-zhaokun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18027

> the location of start-master.sh in spark-standalone.md is wrong
> ---
>
> Key: SPARK-20796
> URL: https://issues.apache.org/jira/browse/SPARK-20796
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: liuzhaokun
>
> the location of start-master.sh in spark-standalone.md should be 
> "sbin/start-master.sh" rather than "bin/start-master.sh".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20796) the location of start-master.sh in spark-standalone.md is wrong

2017-05-18 Thread liuzhaokun (JIRA)
liuzhaokun created SPARK-20796:
--

 Summary: the location of start-master.sh in spark-standalone.md is 
wrong
 Key: SPARK-20796
 URL: https://issues.apache.org/jira/browse/SPARK-20796
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.1.1
Reporter: liuzhaokun


the location of start-master.sh in spark-standalone.md should be 
"sbin/start-master.sh" rather than "bin/start-master.sh".



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19275) Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-05-18 Thread Aaquib Khwaja (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015591#comment-16015591
 ] 

Aaquib Khwaja commented on SPARK-19275:
---

Hi [~dmitry_iii], I also ran into a similar issue.
I've set the value of 'spark.streaming.kafka.consumer.poll.ms' as 6, but 
i'm still running into issues.
Here is the stack trace and other details: 
http://stackoverflow.com/questions/44045323/sparkstreamingkafka-failed-to-get-records-after-polling-for-6

Also, below are some relevant configs:
batch.interval = 60s
spark.streaming.kafka.consumer.poll.ms = 6
session.timeout.ms = 6 (default: 3)
heartbeat.interval.ms = 6000 (default: 3000)
request.timeout.ms = 9 (default: 4)

Any help would be great !
Thanks.

> Spark Streaming, Kafka receiver, "Failed to get records for ... after polling 
> for 512"
> --
>
> Key: SPARK-19275
> URL: https://issues.apache.org/jira/browse/SPARK-19275
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11
>Reporter: Dmitry Ochnev
>
> We have a Spark Streaming application reading records from Kafka 0.10.
> Some tasks are failed because of the following error:
> "java.lang.AssertionError: assertion failed: Failed to get records for (...) 
> after polling for 512"
> The first attempt fails and the second attempt (retry) completes 
> successfully, - this is the pattern that we see for many tasks in our logs. 
> These fails and retries consume resources.
> A similar case with a stack trace are described here:
> https://www.mail-archive.com/user@spark.apache.org/msg56564.html
> https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767
> Here is the line from the stack trace where the error is raised:
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
> We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, 
> 10, 30 and 60 seconds, but the error appeared in all the cases except the 
> last one. Moreover, increasing the threshold led to increasing total Spark 
> stage duration.
> In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to 
> fewer task failures but with cost of total stage duration. So, it is bad for 
> performance when processing data streams.
> We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other 
> related classes) which inhibits the reading process.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20795) Maximum document frequency for IDF

2017-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20795.
---
Resolution: Invalid

Please start this as a question on the mailing list.

> Maximum document frequency for IDF
> --
>
> Key: SPARK-20795
> URL: https://issues.apache.org/jira/browse/SPARK-20795
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Turan Gojayev
>Priority: Minor
>
> In current implementation of IDF there is no way for setting maximum number 
> of documents for filtering the terms. I assume that the functionality is the 
> same for minimum document frequency, and was wondering, if there is a special 
> reason for not having maxDocFreq parameter and filtering.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20795) Maximum document frequency for IDF

2017-05-18 Thread Turan Gojayev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015580#comment-16015580
 ] 

Turan Gojayev commented on SPARK-20795:
---

I am a total newbie here, so excuse me if I've set anything wrong 

> Maximum document frequency for IDF
> --
>
> Key: SPARK-20795
> URL: https://issues.apache.org/jira/browse/SPARK-20795
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Turan Gojayev
>Priority: Minor
>
> In current implementation of IDF there is no way for setting maximum number 
> of documents for filtering the terms. I assume that the functionality is the 
> same for minimum document frequency, and was wondering, if there is a special 
> reason for not having maxDocFreq parameter and filtering.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20795) Maximum document frequency for IDF

2017-05-18 Thread Turan Gojayev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Turan Gojayev updated SPARK-20795:
--
Description: In current implementation of IDF there is no way for setting 
maximum number of documents for filtering the terms. I assume that the 
functionality is the same for minimum document frequency, and was wondering, if 
there is a special reason for not having maxDocFreq parameter and filtering.

> Maximum document frequency for IDF
> --
>
> Key: SPARK-20795
> URL: https://issues.apache.org/jira/browse/SPARK-20795
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Turan Gojayev
>Priority: Minor
>
> In current implementation of IDF there is no way for setting maximum number 
> of documents for filtering the terms. I assume that the functionality is the 
> same for minimum document frequency, and was wondering, if there is a special 
> reason for not having maxDocFreq parameter and filtering.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20795) Maximum document frequency for IDF

2017-05-18 Thread Turan Gojayev (JIRA)
Turan Gojayev created SPARK-20795:
-

 Summary: Maximum document frequency for IDF
 Key: SPARK-20795
 URL: https://issues.apache.org/jira/browse/SPARK-20795
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.1.0
Reporter: Turan Gojayev
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20784) Spark hangs (v2.0) or Futures timed out (v2.1) after a joinWith() and cache() in YARN client mode

2017-05-18 Thread Mathieu D (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathieu D updated SPARK-20784:
--
Affects Version/s: 2.1.1
  Description: 
Spark hangs and stop executing any job or task (v2.0.2).
Web UI shows *0 active stages* and *0 active task* on executors, although a 
driver thread is clearly working/finishing a stage (see below).

Our application runs several spark contexts for several users in parallel in 
threads. spark version 2.0.2, yarn-client

Extract of thread stack below.

{noformat}
"ForkJoinPool-1-worker-0" #107 daemon prio=5 os_prio=0 tid=0x7fddf0005800 
nid=0x484 waiting on condition [0x7fddd0bf
6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00078c232760> (a 
scala.concurrent.impl.Promise$CompletionLatch)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at 
org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:212)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:120)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:229)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:125)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:124)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at 
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:30)
at 
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:62)
at 
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at 
org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218)
at 
org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218)
at 
org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78)
at 
org.apache.spark.sql.execution.ProjectExe

[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param

2017-05-18 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015455#comment-16015455
 ] 

Nick Pentreath commented on SPARK-20768:


Sure - though perhaps [~yuhaoyan] can give an opinion whether it should be 
added as an explicit {{Param}}?

> PySpark FPGrowth does not expose numPartitions (expert)  param
> --
>
> Key: SPARK-20768
> URL: https://issues.apache.org/jira/browse/SPARK-20768
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. 
> While it is an "expert" param, the general approach elsewhere is to expose 
> these on the Python side (e.g. {{aggregationDepth}} and intermediate storage 
> params in {{ALS}})



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param

2017-05-18 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015448#comment-16015448
 ] 

Yan Facai (颜发才) edited comment on SPARK-20768 at 5/18/17 8:59 AM:
--

It seems easy,  I can work on it.
However, I'm on holiday this weekend. Is it OK to wait one week?


was (Author: facai):
It seems easy,  I can work on it.

> PySpark FPGrowth does not expose numPartitions (expert)  param
> --
>
> Key: SPARK-20768
> URL: https://issues.apache.org/jira/browse/SPARK-20768
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. 
> While it is an "expert" param, the general approach elsewhere is to expose 
> these on the Python side (e.g. {{aggregationDepth}} and intermediate storage 
> params in {{ALS}})



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param

2017-05-18 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015448#comment-16015448
 ] 

Yan Facai (颜发才) commented on SPARK-20768:
-

It seems easy,  I can work on it.

> PySpark FPGrowth does not expose numPartitions (expert)  param
> --
>
> Key: SPARK-20768
> URL: https://issues.apache.org/jira/browse/SPARK-20768
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. 
> While it is an "expert" param, the general approach elsewhere is to expose 
> these on the Python side (e.g. {{aggregationDepth}} and intermediate storage 
> params in {{ALS}})



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param

2017-05-18 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015440#comment-16015440
 ] 

Nick Pentreath commented on SPARK-20768:


It is there - but not documented as a {{Param}} and so doesn't show up in API 
doc, also there is no {{set}} method for it.

> PySpark FPGrowth does not expose numPartitions (expert)  param
> --
>
> Key: SPARK-20768
> URL: https://issues.apache.org/jira/browse/SPARK-20768
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. 
> While it is an "expert" param, the general approach elsewhere is to expose 
> these on the Python side (e.g. {{aggregationDepth}} and intermediate storage 
> params in {{ALS}})



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param

2017-05-18 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015437#comment-16015437
 ] 

Yan Facai (颜发才) commented on SPARK-20768:
-

Hi, I'm newbie.
`numPartitions` is found in pyspark code, could you explain more details?
thanks.

```python
def __init__(self, minSupport=0.3, minConfidence=0.8, itemsCol="items",
 predictionCol="prediction", numPartitions=None):
```

> PySpark FPGrowth does not expose numPartitions (expert)  param
> --
>
> Key: SPARK-20768
> URL: https://issues.apache.org/jira/browse/SPARK-20768
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. 
> While it is an "expert" param, the general approach elsewhere is to expose 
> these on the Python side (e.g. {{aggregationDepth}} and intermediate storage 
> params in {{ALS}})



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16202) Misleading Description of CreatableRelationProvider's createRelation

2017-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015386#comment-16015386
 ] 

Apache Spark commented on SPARK-16202:
--

User 'jaceklaskowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/18026

> Misleading Description of CreatableRelationProvider's createRelation
> 
>
> Key: SPARK-16202
> URL: https://issues.apache.org/jira/browse/SPARK-16202
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.1.0
>
>
> The API description of {{createRelation}} in {{CreatableRelationProvider}} is 
> misleading. The current description only expects users to return the 
> relation. However, the major goal of this API should also include saving the 
> Dataframe.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20506) ML, Graph 2.2 QA: Programming guide update and migration guide

2017-05-18 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015382#comment-16015382
 ] 

Nick Pentreath commented on SPARK-20506:


Oh also SPARK-14503 is important

> ML, Graph 2.2 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-20506
> URL: https://issues.apache.org/jira/browse/SPARK-20506
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20506) ML, Graph 2.2 QA: Programming guide update and migration guide

2017-05-18 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015362#comment-16015362
 ] 

Nick Pentreath commented on SPARK-20506:


Cool - I've added a section before the Migration Guide in the linked PR. Any 
other suggestions for items to include please say here or on the PR. Thanks!

> ML, Graph 2.2 QA: Programming guide update and migration guide
> --
>
> Key: SPARK-20506
> URL: https://issues.apache.org/jira/browse/SPARK-20506
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
>
> Before the release, we need to update the MLlib and GraphX Programming 
> Guides.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV

2017-05-18 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015345#comment-16015345
 ] 

Yan Facai (颜发才) commented on SPARK-19581:
-

[~barrybecker4] Hi, Becker.
I can't reproduce the bug on spark-2.1.1-bin-hadoop2.7.

1) For 0 size of feature, the exception is harmless.

```scala
  val data = 
spark.read.format("libsvm").load("/user/facai/data/libsvm/sample_libsvm_data.txt").cache
  import org.apache.spark.ml.classification.NaiveBayes
  val model = new NaiveBayes().fit(data)
  import org.apache.spark.ml.linalg.{Vectors => SV}
  case class TestData(features: org.apache.spark.ml.linalg.Vector)
  val emptyVector = SV.sparse(0, Array.empty[Int], Array.empty[Double])
  val test = Seq(TestData(emptyVector)).toDF
scala>  test.show
+-+
| features|
+-+
|(0,[],[])|
+-+

scala> model.transform(test).show
org.apache.spark.SparkException: Failed to execute user defined 
function($anonfun$1: (vector) => vector)
  at 
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072)
  ... 48 elided
Caused by: java.lang.IllegalArgumentException: requirement failed: The columns 
of A don't match the number of elements of x. A: 692, x: 0
  at scala.Predef$.require(Predef.scala:224)
  ... 99 more
```

2) For 692 size of empty feature, it's OK.

```scala
scala> val emptyVector = SV.sparse(692, Array.empty[Int], Array.empty[Double])
emptyVector: org.apache.spark.ml.linalg.Vector = (692,[],[])

scala> val test = Seq(TestData(emptyVector)).toDF
test: org.apache.spark.sql.DataFrame = [features: vector]

scala> test.show
+---+
|   features|
+---+
|(692,[],[])|
+---+

scala> model.transform(test).show
+---+++--+
|   features|   rawPrediction| probability|prediction|
+---+++--+
|(692,[],[])|[-0.8407831793660...|[0.43137254901960...|   1.0|
+---+++--+

```

> running NaiveBayes model with 0 features can crash the executor with D 
> rorreGEMV
> 
>
> Key: SPARK-19581
> URL: https://issues.apache.org/jira/browse/SPARK-19581
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
> Environment: spark development or standalone mode on windows or linux.
>Reporter: Barry Becker
>Priority: Minor
>
> The severity of this bug is high (because nothing should cause spark to crash 
> like this) but the priority may be low (because there is an easy workaround).
> In our application, a user can select features and a target to run the 
> NaiveBayes inducer. If columns have too many values or all one value, they 
> will be removed before we call the inducer to create the model. As a result, 
> there are some cases, where all the features may get removed. When this 
> happens, executors will crash and get restarted (if on a cluster) or spark 
> will crash and need to be manually restarted (if in development mode).
> It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well 
> when it is encountered. I emits this vague error :
> ** On entry to DGEMV  parameter number  6 had an illegal value
> and terminates.
> My code looks like this:
> {code}
>val predictions = model.transform(testData)  // Make predictions
> // figure out how many were correctly predicted
> val numCorrect = predictions.filter(new Column(actualTarget) === new 
> Column(PREDICTION_LABEL_COLUMN)).count()
> val numIncorrect = testRowCount - numCorrect
> {code}
> The failure is at the line that does the count, but it is not the count that 
> causes the problem, it is the model.transform step (where the model contains 
> the NaiveBayes classifier).
> Here is the stack trace (in development mode):
> {code}
> [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] 
> [akka://JobServer/user/context-supervisor/sql-context] -  done making 
> predictions in 232
>  ** On entry to DGEMV  parameter number  6 had an illegal value
>  ** On entry to DGEMV  parameter number  6 had an illegal value
>  ** On entry to DGEMV  parameter number  6 had an illegal value
> [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505)
> [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] 
> [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@1f6c4a29)
> [2017-02-13 06:28:40,508] ERROR .scheduler.LiveListenerBus [] 

[jira] [Resolved] (SPARK-20794) Spark show() command on dataset does not retrieve consistent rows from DASHDB data source

2017-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20794.
---
Resolution: Invalid

It's a question, so belongs on the mailing list. I think it's a DASHDB 
question. show is just picking from the first partition of the underlying data 
source.

> Spark show() command on dataset does not retrieve consistent rows from DASHDB 
> data source
> -
>
> Key: SPARK-20794
> URL: https://issues.apache.org/jira/browse/SPARK-20794
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Sahana HA
>Priority: Minor
>
> When the user creates the dataframe from DASHDB data source (which is a 
> relational database) and executes df.show(5) it returns different result sets 
> or rows during each execution. We are aware that show(5) will pick the first 
> 5 rows from existing partition and hence it is not guaranteed to be 
> consistent across each execution. 
> However when we try the same show(5) command against S3 storage or 
> bluemixobject store (non-relational data source) we always get the same 
> result sets or rows in order, across each execution.
> We just wanted to confirm why the difference between DASHDB and other data 
> source like S3/Bluemixobjectstore ? Is the issue with spark or DASHDB alone ? 
> or is the inconsistent rows behavior is there for all relational data source ?
> Repro snippet:
> -- Load the data from dashdb
> val dashdb = 
> sqlContext.read.format("packageName").options(dashdbreadOptions).load
> -- execution #1
> dashdb.show(5)
> +++-+---+-+-+--+---+--++
> |PRODUCT_LINE|PRODUCT_TYPE|CUST_ORDER_NUMBER|   CITY|STATE|  
> COUNTRY|GENDER|AGE|MARITAL_STATUS|  PROFESSION|
> +++-+---+-+-+--+---+--++
> |Personal Accessories| Eyewear|   107861|Rutland|   VT|United 
> States| F| 39|   Married|   Sales|
> |   Camping Equipment|Lanterns|   189003| Sydney|  NSW|
> Australia| F| 20|Single| Hospitality|
> |   Camping Equipment|Cooking Gear|   107863| Sydney|  NSW|
> Australia| F| 20|Single| Hospitality|
> |Personal Accessories| Eyewear|   189005|Villach|   NA|  
> Austria| F| 37|   Married|Professional|
> |Personal Accessories| Eyewear|   107865|Villach|   NA|  
> Austria| F| 37|   Married|Professional|
> +++-+---+-+-+--+---+--++
> only showing top 5 rows
> -- execution #2
> dashdb.show(5)
> +++-++-+--+--+---+--+---+
> |PRODUCT_LINE|PRODUCT_TYPE|CUST_ORDER_NUMBER|CITY|STATE|  
>  COUNTRY|GENDER|AGE|MARITAL_STATUS| PROFESSION|
> +++-++-+--+--+---+--+---+
> |Mountaineering Eq...|   Tools|   112835|  Portsmouth|   
> NA|United Kingdom| M| 24|Single|  Other|
> |   Camping Equipment|Cooking Gear|   193902|Jacksonville|   FL| 
> United States| F| 22|Single|Hospitality|
> |   Camping Equipment|   Packs|   112837|Jacksonville|   FL| 
> United States| F| 22|Single|Hospitality|
> |Mountaineering Eq...|Rope|   193904|Jacksonville|   FL| 
> United States| F| 31|   Married|  Other|
> |  Golf Equipment| Putters|   112839|Jacksonville|   FL| 
> United States| F| 31|   Married|  Other|
> +++-++-+--+--+---+--+---+
> only showing top 5 rows



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20794) Spark show() command on dataset does not retrieve consistent rows from DASHDB data source

2017-05-18 Thread Sahana HA (JIRA)
Sahana HA created SPARK-20794:
-

 Summary: Spark show() command on dataset does not retrieve 
consistent rows from DASHDB data source
 Key: SPARK-20794
 URL: https://issues.apache.org/jira/browse/SPARK-20794
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Sahana HA
Priority: Minor


When the user creates the dataframe from DASHDB data source (which is a 
relational database) and executes df.show(5) it returns different result sets 
or rows during each execution. We are aware that show(5) will pick the first 5 
rows from existing partition and hence it is not guaranteed to be consistent 
across each execution. 

However when we try the same show(5) command against S3 storage or 
bluemixobject store (non-relational data source) we always get the same result 
sets or rows in order, across each execution.

We just wanted to confirm why the difference between DASHDB and other data 
source like S3/Bluemixobjectstore ? Is the issue with spark or DASHDB alone ? 
or is the inconsistent rows behavior is there for all relational data source ?

Repro snippet:

-- Load the data from dashdb
val dashdb = 
sqlContext.read.format("packageName").options(dashdbreadOptions).load


-- execution #1

dashdb.show(5)

+++-+---+-+-+--+---+--++
|PRODUCT_LINE|PRODUCT_TYPE|CUST_ORDER_NUMBER|   CITY|STATE|  
COUNTRY|GENDER|AGE|MARITAL_STATUS|  PROFESSION|
+++-+---+-+-+--+---+--++
|Personal Accessories| Eyewear|   107861|Rutland|   VT|United 
States| F| 39|   Married|   Sales|
|   Camping Equipment|Lanterns|   189003| Sydney|  NSW|
Australia| F| 20|Single| Hospitality|
|   Camping Equipment|Cooking Gear|   107863| Sydney|  NSW|
Australia| F| 20|Single| Hospitality|
|Personal Accessories| Eyewear|   189005|Villach|   NA|  
Austria| F| 37|   Married|Professional|
|Personal Accessories| Eyewear|   107865|Villach|   NA|  
Austria| F| 37|   Married|Professional|
+++-+---+-+-+--+---+--++
only showing top 5 rows




-- execution #2


dashdb.show(5)

+++-++-+--+--+---+--+---+
|PRODUCT_LINE|PRODUCT_TYPE|CUST_ORDER_NUMBER|CITY|STATE|   
COUNTRY|GENDER|AGE|MARITAL_STATUS| PROFESSION|
+++-++-+--+--+---+--+---+
|Mountaineering Eq...|   Tools|   112835|  Portsmouth|   NA|United 
Kingdom| M| 24|Single|  Other|
|   Camping Equipment|Cooking Gear|   193902|Jacksonville|   FL| United 
States| F| 22|Single|Hospitality|
|   Camping Equipment|   Packs|   112837|Jacksonville|   FL| United 
States| F| 22|Single|Hospitality|
|Mountaineering Eq...|Rope|   193904|Jacksonville|   FL| United 
States| F| 31|   Married|  Other|
|  Golf Equipment| Putters|   112839|Jacksonville|   FL| United 
States| F| 31|   Married|  Other|
+++-++-+--+--+---+--+---+
only showing top 5 rows





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-20793) cache table will not refresh after insert data to some broadcast table

2017-05-18 Thread du (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

du closed SPARK-20793.
--
Resolution: Not A Problem

> cache table will not refresh after insert data to some broadcast table
> --
>
> Key: SPARK-20793
> URL: https://issues.apache.org/jira/browse/SPARK-20793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: du
>
> run below sql in spark-sql or beeline
> create table t4(c1 int,c2 int);
> insert into table t4 select 1,2;
> insert into table t4 select 2,2;
> create table t5(c1 int,c2 int);
> insert into table t5 select 2,3;
> cache table t3 as select t4.c1 as c1,t4.c2 as c2,t5.c1 as c3, t5.c2 as c4 
> from t4 join t5 on t4.c2=t5.c1;
> cache table t6 as select * from t4 join t3 on t4.c2=t3.c2;
> select * from t3;
> select * from t6;
> insert into table t5 select 2,4;
> select * from t3;
> select * from t6;
> after insert table t5, t3 and t6 are not include data 2,4



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20793) cache table will not refresh after insert data to some broadcast table

2017-05-18 Thread du (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

du updated SPARK-20793:
---
Description: 
run below sql in spark-sql or beeline

create table t4(c1 int,c2 int);
insert into table t4 select 1,2;
insert into table t4 select 2,2;
create table t5(c1 int,c2 int);
insert into table t5 select 2,3;

cache table t3 as select t4.c1 as c1,t4.c2 as c2,t5.c1 as c3, t5.c2 as c4 from 
t4 join t5 on t4.c2=t5.c1;
cache table t6 as select * from t4 join t3 on t4.c2=t3.c2;
select * from t3;
select * from t6;
insert into table t5 select 2,4;

select * from t3;
select * from t6;

after insert table t5, t3 and t6 are not include data 2,4

  was:
create table t4(c1 int,c2 int);
insert into table t4 select 1,2;
insert into table t4 select 2,2;
create table t5(c1 int,c2 int);
insert into table t5 select 2,3;
run below sql
cache table t3 as select t4.c1 as c1,t4.c2 as c2,t5.c1 as c3, t5.c2 as c4 from 
t4 join t5 on t4.c2=t5.c1;
cache table t6 as select * from t4 join t3 on t4.c2=t3.c2;
select * from t3;
select * from t6;
insert into table t5 select 2,4;

select * from t3;
select * from t6;

after insert table t5, t3 and t6 are not include data 2,4


> cache table will not refresh after insert data to some broadcast table
> --
>
> Key: SPARK-20793
> URL: https://issues.apache.org/jira/browse/SPARK-20793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: du
>
> run below sql in spark-sql or beeline
> create table t4(c1 int,c2 int);
> insert into table t4 select 1,2;
> insert into table t4 select 2,2;
> create table t5(c1 int,c2 int);
> insert into table t5 select 2,3;
> cache table t3 as select t4.c1 as c1,t4.c2 as c2,t5.c1 as c3, t5.c2 as c4 
> from t4 join t5 on t4.c2=t5.c1;
> cache table t6 as select * from t4 join t3 on t4.c2=t3.c2;
> select * from t3;
> select * from t6;
> insert into table t5 select 2,4;
> select * from t3;
> select * from t6;
> after insert table t5, t3 and t6 are not include data 2,4



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org