date:20170302

[jira] [Assigned] (SPARK-18726) Filesystem unnecessarily scanned twice during creation of non-catalog table

2017-03-02 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18726:
---

Assignee: Song Jun

> Filesystem unnecessarily scanned twice during creation of non-catalog table
> ---
>
> Key: SPARK-18726
> URL: https://issues.apache.org/jira/browse/SPARK-18726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Assignee: Song Jun
> Fix For: 2.2.0
>
>
> It seems that for non-catalog tables (e.g. spark.read.parquet(...)), we scan 
> the filesystem twice, once for schema inference, and another to create a 
> FileIndex class for the relation.
> It would be better to combine these scans somehow, since this is the most 
> costly step of creating a table. This is a follow-up ticket to 
> https://github.com/apache/spark/pull/16090.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18726) Filesystem unnecessarily scanned twice during creation of non-catalog table

2017-03-02 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18726.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17081
[https://github.com/apache/spark/pull/17081]

> Filesystem unnecessarily scanned twice during creation of non-catalog table
> ---
>
> Key: SPARK-18726
> URL: https://issues.apache.org/jira/browse/SPARK-18726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
> Fix For: 2.2.0
>
>
> It seems that for non-catalog tables (e.g. spark.read.parquet(...)), we scan 
> the filesystem twice, once for schema inference, and another to create a 
> FileIndex class for the relation.
> It would be better to combine these scans somehow, since this is the most 
> costly step of creating a table. This is a follow-up ticket to 
> https://github.com/apache/spark/pull/16090.
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors

2017-03-02 Thread Paul Lysak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893877#comment-15893877
 ] 

Paul Lysak commented on SPARK-19371:


I'm observing similar behavior in Spark 2.1 - unfortunately, due to complex 
workflow of our application wasn't yet able to identify after which operation 
exactly all the partitions of DataFrame end up on a single executor, so no 
matter how big cluster is - only one executor picks all the job.

> Cannot spread cached partitions evenly across executors
> ---
>
> Key: SPARK-19371
> URL: https://issues.apache.org/jira/browse/SPARK-19371
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Thunder Stumpges
>
> Before running an intensive iterative job (in this case a distributed topic 
> model training), we need to load a dataset and persist it across executors. 
> After loading from HDFS and persisting, the partitions are spread unevenly 
> across executors (based on the initial scheduling of the reads which are not 
> data locale sensitive). The partition sizes are even, just not their 
> distribution over executors. We currently have no way to force the partitions 
> to spread evenly, and as the iterative algorithm begins, tasks are 
> distributed to executors based on this initial load, forcing some very 
> unbalanced work.
> This has been mentioned a 
> [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059]
>  of 
> [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html]
>  in 
> [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html]
>  user/dev group threads.
> None of the discussions I could find had solutions that worked for me. Here 
> are examples of things I have tried. All resulted in partitions in memory 
> that were NOT evenly distributed to executors, causing future tasks to be 
> imbalanced across executors as well.
> *Reduce Locality*
> {code}spark.shuffle.reduceLocality.enabled=false/true{code}
> *"Legacy" memory mode*
> {code}spark.memory.useLegacyMode = true/false{code}
> *Basic load and repartition*
> {code}
> val numPartitions = 48*16
> val df = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions).
> persist
> df.count
> {code}
> *Load and repartition to 2x partitions, then shuffle repartition down to 
> desired partitions*
> {code}
> val numPartitions = 48*16
> val df2 = sqlContext.read.
> parquet("/data/folder_to_load").
> repartition(numPartitions*2)
> val df = df2.repartition(numPartitions).
> persist
> df.count
> {code}
> It would be great if when persisting an RDD/DataFrame, if we could request 
> that those partitions be stored evenly across executors in preparation for 
> future tasks. 
> I'm not sure if this is a more general issue (I.E. not just involving 
> persisting RDDs), but for the persisted in-memory case, it can make a HUGE 
> difference in the over-all running time of the remaining work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19339) StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next on empty iterator

2017-03-02 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893858#comment-15893858
 ] 

Nick Pentreath commented on SPARK-19339:


This should be addressed by SPARK-19573 - empty (or all null) columns will 
return empty Array rather than throw exception.

> StatFunctions.multipleApproxQuantiles can give NoSuchElementException: next 
> on empty iterator
> -
>
> Key: SPARK-19339
> URL: https://issues.apache.org/jira/browse/SPARK-19339
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Barry Becker
>Priority: Minor
>
> This problem is easy to reproduce by running 
> StatFunctions.multipleApproxQuantiles on an empty dataset, but I think it can 
> occur in other cases, like if the column is all null or all one value.
> I have unit tests that can hit it in several different cases.
> The fix that I have introduced locally is to return
> {code}
>  if (sampled.length == 0) 0 else sampled.last.value
> {code}
> instead of 
> {code}
> sampled.last.value
> {code}
> at the end of QuantileSummaries.query.
> Below is the exception:
> {code}
> next on empty iterator
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.last(TraversableLike.scala:459)
>   at 
> scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$last(ArrayOps.scala:186)
>   at 
> scala.collection.IndexedSeqOptimized$class.last(IndexedSeqOptimized.scala:132)
>   at scala.collection.mutable.ArrayOps$ofRef.last(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.catalyst.util.QuantileSummaries.query(QuantileSummaries.scala:207)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply$mcDD$sp(SparkPercentileCalculator.scala:91)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1$$anonfun$apply$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator$$anonfun$multipleApproxQuantiles$1.apply(SparkPercentileCalculator.scala:91)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at 
> org.apache.spark.sql.SparkPercentileCalculator.multipleApproxQuantiles(SparkPercentileCalculator.scala:91)
>   at 
> com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles$lzycompute(ContinuousMinesetStats.scala:274)
>   at 
> com.mineset.spark.statistics.model.ContinuousMinesetStats.quartiles(ContinuousMinesetStats.scala:272)
>   at 
> com.mineset.spark.statistics.model.MinesetStats.com$mineset$spark$statistics$model$MinesetStats$$serializeContinuousFeature$1(MinesetStats.scala:66)
>   at 
> com.mineset.spark.statistics.model.MinesetStats$$anonfun$calculateWithColumns$1.apply(MinesetStats.scala:118)
>   at 
> com.mineset.spark.statistics.model.MinesetStats$$anonfun$calculateWithColumns$1.apply(MinesetStats.scala:114)
>

[jira] [Commented] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

2017-03-02 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893821#comment-15893821
 ] 

Nick Pentreath commented on SPARK-19714:


If you feel that handling values outside the bucket ranges as "invalid" is 
reasonable - specifically including them in the special "invalid" bucket - then 
we can discuss if and how that could be implemented.

I agree it's quite a large departure, but we could support it with a further 
param value such as "keepAll" which keeps both {{NaN}} and values outside of 
range in the special bucket.

I don't see a compelling reason that this is a bug, so if you want to motivate 
for a change then propose an approach. 

I do think we should update the doc for {{handleInvalid}} - [~wojtek-szymanski] 
feel free to open a PR for that.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---
>
> Key: SPARK-19714
> URL: https://issues.apache.org/jira/browse/SPARK-19714
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of 
> Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the 
> lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-03-02 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893811#comment-15893811
 ] 

Nick Pentreath commented on SPARK-19747:


Also agree we should be able to extract out the penalty / regularization term. 
I know Seth's done that in the WIP PR for L2 - but L1 is interesting because 
currently it is more tightly baked into the Breeze optimizers...

> Consolidate code in ML aggregators
> --
>
> Key: SPARK-19747
> URL: https://issues.apache.org/jira/browse/SPARK-19747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable 
> loss function over a parameter vector. We implement these by having a loss 
> function accumulate the gradient using an Aggregator class which has methods 
> that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm 
> that obeys this form implements a cost function class and an aggregator 
> class, which are completely separate from one another but share probably 80% 
> of the same code. 
> I think it is important to clean things like this up, and if we can do it 
> properly it will make the code much more maintainable, readable, and bug 
> free. It will also help reduce the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to 
> implement the {{add}} function. This is really the only difference in the 
> current aggregators.
> 2. Have a single, generic cost function that is parameterized by the 
> aggregator type. This reduces the many places we implement cost functions and 
> greatly reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-03-02 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893810#comment-15893810
 ] 

Nick Pentreath commented on SPARK-19747:


[~yuhaoyan] for {{SGDClassifier}} it would be interesting to look at Vowpal 
Wabbit - style normalized adaptive gradient descent approach which does the 
normalization / standardization on the fly during training.

> Consolidate code in ML aggregators
> --
>
> Key: SPARK-19747
> URL: https://issues.apache.org/jira/browse/SPARK-19747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable 
> loss function over a parameter vector. We implement these by having a loss 
> function accumulate the gradient using an Aggregator class which has methods 
> that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm 
> that obeys this form implements a cost function class and an aggregator 
> class, which are completely separate from one another but share probably 80% 
> of the same code. 
> I think it is important to clean things like this up, and if we can do it 
> properly it will make the code much more maintainable, readable, and bug 
> free. It will also help reduce the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to 
> implement the {{add}} function. This is really the only difference in the 
> current aggregators.
> 2. Have a single, generic cost function that is parameterized by the 
> aggregator type. This reduces the many places we implement cost functions and 
> greatly reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-18478) Support codegen for Hive UDFs

2017-03-02 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-18478.

Resolution: Won't Fix

> Support codegen for Hive UDFs
> -
>
> Key: SPARK-18478
> URL: https://issues.apache.org/jira/browse/SPARK-18478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently does not codegen Hive UDFs in hiveUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19779) structured streaming exist needless tmp file

2017-03-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-19779:


Assignee: Feng Gui

> structured streaming exist needless tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.3, 2.1.1, 2.2.0
>Reporter: Feng Gui
>Assignee: Feng Gui
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19779) structured streaming exist needless tmp file

2017-03-02 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19779.
--
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> structured streaming exist needless tmp file 
> -
>
> Key: SPARK-19779
> URL: https://issues.apache.org/jira/browse/SPARK-19779
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.3, 2.1.1, 2.2.0
>Reporter: Feng Gui
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> The PR (https://github.com/apache/spark/pull/17012) can to fix restart a 
> Structured Streaming application using hdfs as fileSystem, but also exist a 
> problem that a tmp file of delta file is still reserved in hdfs. And 
> Structured Streaming don't delete the tmp file generated when restart 
> streaming job in future, so we need to delete the tmp file after restart 
> streaming job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19805) Log the row type when query result dose not match

2017-03-02 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu updated SPARK-19805:
--
Summary: Log the row type when query result dose not match  (was: Log the 
row type when query result dose match)

> Log the row type when query result dose not match
> -
>
> Key: SPARK-19805
> URL: https://issues.apache.org/jira/browse/SPARK-19805
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893736#comment-15893736
 ] 

Shivaram Venkataraman commented on SPARK-19796:
---

I think (a) is worth exploring in a new JIRA -- We should try to avoid sending 
data that we dont need on the executors during task execution.

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19806) PySpark GLR supports tweedie distribution

2017-03-02 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-19806:
---

Assignee: Yanbo Liang

> PySpark GLR supports tweedie distribution
> -
>
> Key: SPARK-19806
> URL: https://issues.apache.org/jira/browse/SPARK-19806
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> PySpark {{GeneralizedLinearRegression}} supports tweedie distribution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19806) PySpark GLR supports tweedie distribution

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19806:


Assignee: (was: Apache Spark)

> PySpark GLR supports tweedie distribution
> -
>
> Key: SPARK-19806
> URL: https://issues.apache.org/jira/browse/SPARK-19806
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> PySpark {{GeneralizedLinearRegression}} supports tweedie distribution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19806) PySpark GLR supports tweedie distribution

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19806:


Assignee: Apache Spark

> PySpark GLR supports tweedie distribution
> -
>
> Key: SPARK-19806
> URL: https://issues.apache.org/jira/browse/SPARK-19806
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> PySpark {{GeneralizedLinearRegression}} supports tweedie distribution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19806) PySpark GLR supports tweedie distribution

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893733#comment-15893733
 ] 

Apache Spark commented on SPARK-19806:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17146

> PySpark GLR supports tweedie distribution
> -
>
> Key: SPARK-19806
> URL: https://issues.apache.org/jira/browse/SPARK-19806
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> PySpark {{GeneralizedLinearRegression}} supports tweedie distribution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19806) PySpark GLR supports tweedie distribution

2017-03-02 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-19806:
---

 Summary: PySpark GLR supports tweedie distribution
 Key: SPARK-19806
 URL: https://issues.apache.org/jira/browse/SPARK-19806
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Yanbo Liang
Priority: Minor


PySpark {{GeneralizedLinearRegression}} supports tweedie distribution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893713#comment-15893713
 ] 

Hyukjin Kwon commented on SPARK-15474:
--

Let me leave some pointer - 
https://github.com/apache/hive/blob/branch-1.2/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java#L96-L106

but it seems now it does not write the empty one in the master 
-https://github.com/apache/hive/blob/4a42bec6ba4cb8257dec517bc7c45b6a8f5a9e67/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java#L116

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10294) When Parquet writer's close method throws an exception, we will call close again and trigger a NPE

2017-03-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893709#comment-15893709
 ] 

Hyukjin Kwon commented on SPARK-10294:
--

Maybe, we could resolve this as a duplicate of SPARK-13127 if it is true 
because I see SPARK-18140 is also resolved as a duplicate.

> When Parquet writer's close method throws an exception, we will call close 
> again and trigger a NPE
> --
>
> Key: SPARK-10294
> URL: https://issues.apache.org/jira/browse/SPARK-10294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> When a task saves a large parquet file (larger than the S3 file size limit) 
> to S3, looks like we still call parquet writer's close twice and triggers NPE 
> reported in SPARK-7837. Eventually, job failed and I got NPE as the 
> exception. Actually, the real problem was that the file was too large for S3.
> {code}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1818)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1908)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> com.databricks.spark.sql.perf.tpcds.Tables$Table.genData(Tables.scala:147)
>   at 
> com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:192)
>   at 
> com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Ta

[jira] [Commented] (SPARK-10294) When Parquet writer's close method throws an exception, we will call close again and trigger a NPE

2017-03-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893706#comment-15893706
 ] 

Hyukjin Kwon commented on SPARK-10294:
--

Hi [~yhuai], it seems this issue refers PARQUET-544 which is fixed in Parquet 
1.9.0.

> When Parquet writer's close method throws an exception, we will call close 
> again and trigger a NPE
> --
>
> Key: SPARK-10294
> URL: https://issues.apache.org/jira/browse/SPARK-10294
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
> Attachments: screenshot-1.png
>
>
> When a task saves a large parquet file (larger than the S3 file size limit) 
> to S3, looks like we still call parquet writer's close twice and triggers NPE 
> reported in SPARK-7837. Eventually, job failed and I got NPE as the 
> exception. Actually, the real problem was that the file was too large for S3.
> {code}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1280)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1268)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1267)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1267)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1493)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1455)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1444)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1818)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1831)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1908)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:108)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>   at 
> com.databricks.spark.sql.perf.tpcds.Tables$Table.genData(Tables.scala:147)
>   at 
> com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:192)
>   at 
> com.databricks.spark.sql.perf.tpcds.Tables$$anonfun$genData$2.apply(Tables.scala:190)
>   at scala.collection.imm

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-02 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893703#comment-15893703
 ] 

Nicholas Chammas commented on SPARK-15474:
--

cc [~owen.omalley]

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19805) Log the row type when query result dose match

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19805:


Assignee: Apache Spark

> Log the row type when query result dose match
> -
>
> Key: SPARK-19805
> URL: https://issues.apache.org/jira/browse/SPARK-19805
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19805) Log the row type when query result dose match

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893693#comment-15893693
 ] 

Apache Spark commented on SPARK-19805:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17145

> Log the row type when query result dose match
> -
>
> Key: SPARK-19805
> URL: https://issues.apache.org/jira/browse/SPARK-19805
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19805) Log the row type when query result dose match

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19805:


Assignee: (was: Apache Spark)

> Log the row type when query result dose match
> -
>
> Key: SPARK-19805
> URL: https://issues.apache.org/jira/browse/SPARK-19805
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19805) Log the row type when query result dose match

2017-03-02 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-19805:
-

 Summary: Log the row type when query result dose match
 Key: SPARK-19805
 URL: https://issues.apache.org/jira/browse/SPARK-19805
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

2017-03-02 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893691#comment-15893691
 ] 

Hyukjin Kwon commented on SPARK-15474:
--

It seems an issue related with Hive's {{OrcOutputFormat}}. It seems the record 
writer does not write the footer if any row is not written but it writes an 
empty one when it closes.
Currently, we are lazily creating the {{RecordWriter}} in Spark side in 
{{OrcFileFormat}} so currently the empty one is not being created if any row is 
not written.

>  ORC data source fails to write and read back empty dataframe
> -
>
> Key: SPARK-15474
> URL: https://issues.apache.org/jira/browse/SPARK-15474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc18gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19745) SVCAggregator serializes coefficients

2017-03-02 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-19745.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> SVCAggregator serializes coefficients
> -
>
> Key: SPARK-19745
> URL: https://issues.apache.org/jira/browse/SPARK-19745
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
> Fix For: 2.2.0
>
>
> Similar to [SPARK-16008|https://issues.apache.org/jira/browse/SPARK-16008], 
> the SVC aggregator captures the coefficients in the class closure, and 
> therefore ships them around during optimization. We can prevent this with a 
> bit of reorganization of the aggregator class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893640#comment-15893640
 ] 

Apache Spark commented on SPARK-19803:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17144

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sital Kedia
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19803:


Assignee: (was: Apache Spark)

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sital Kedia
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19803:


Assignee: Apache Spark

> Flaky BlockManagerProactiveReplicationSuite tests
> -
>
> Key: SPARK-19803
> URL: https://issues.apache.org/jira/browse/SPARK-19803
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> The tests added for BlockManagerProactiveReplicationSuite has made the 
> jenkins build flaky. Please refer to the build for more details - 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-03-02 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893596#comment-15893596
 ] 

zhengruifeng commented on SPARK-18608:
--

[~mlnick] [~yuhaoyan] [~srowen] I think if we use {{train(dataset: Dataset[_], 
handlePersistence: Boolean)}} instead of {{train(dataset: Dataset[_])}} may 
result in extra problems for external implementers, because the existing 
external algorithms overriding {{Predictor.train}} will not work. 

I think we can do it in another way:
{code}
abstract class Predictor[
FeaturesType,
Learner <: Predictor[FeaturesType, Learner, M],
M <: PredictionModel[FeaturesType, M]]
  extends Estimator[M] with PredictorParams {

  protected var storageLevel = StorageLevel.NONE // 

  override def fit(dataset: Dataset[_]): M = {
storageLevel = dataset.storageLevel
...
  }

  protected def train(dataset: Dataset[_]): M
{code}

so in algorithm implementations we can use the orignial storageLevel of the 
input dataset.

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893584#comment-15893584
 ] 

Mridul Muralidharan commented on SPARK-19796:
-


I would not prefer (b) - if we are worried that users are depending on a 
private property, sending a truncated version of it is to aggravate it ! I 
would rather fail-fast with missing value.

Having said that, while we should limit our internal usage of properties, since 
this is also used to propagate user specified key value pairs; adding limits or 
log messages might not be optimal. Worst case, if we start detecting that the 
properties Map is growing really large, we could broadcast it (ugh ?).

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19802) Remote History Server

2017-03-02 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893580#comment-15893580
 ] 

Saisai Shao commented on SPARK-19802:
-

Spark's {{ApplicationHistoryProvider}} is pluggable, user could implement their 
own provider and plug into Spark's history server. So you could implement a 
{{HistoryProvider}} you wanted out of Spark.

>From your description, this is more like a Hadoop ATS (Hadoop application 
>timeline server). We have an implementation of Timeline based history provider 
>for Spark's history server. The main feature is like what you mentioned query 
>through TCP, get the event and display on UI.

> Remote History Server
> -
>
> Key: SPARK-19802
> URL: https://issues.apache.org/jira/browse/SPARK-19802
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Ben Barnard
>
> Currently the history server expects to find history in a filesystem 
> somewhere. It would be nice to have a history server that listens for 
> application events on a TCP port, and have a EventLoggingListener that sends 
> events to the listening history server instead of writing to a file. This 
> would allow the history server to show up-to-date history for past and 
> running jobs in a cluster environment that lacks a shared filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore

2017-03-02 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893579#comment-15893579
 ] 

Marcelo Vanzin commented on SPARK-19804:


For posterity, the error you get looks like this:

{noformat}
java.lang.ExceptionInInitializerError: null
at java.lang.Class.getConstructor0(Class.java:2892)
at java.lang.Class.getDeclaredConstructor(Class.java:2058)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1541)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3220)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3239)
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3464)
at 
org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:226)
at 
org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:210)
at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:333)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:294)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:269)
at 
org.apache.spark.sql.hive.client.ClientWrapper.client(ClientWrapper.scala:272)
{noformat}

Which is rather cryptic; it's caused by one of the classes in the constructor 
being loaded by two different class loaders, so {{getDeclaredConstructor}} 
fails to find the right constructor and returns null.

> HiveClientImpl does not work with Hive 2.2.0 metastore
> --
>
> Key: SPARK-19804
> URL: https://issues.apache.org/jira/browse/SPARK-19804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> I know that Spark currently does not officially support Hive 2.2 (perhaps 
> because it hasn't been released yet); but we have some 2.2 patches in CDH and 
> the current code in the isolated client fails. The most probably culprit are 
> changes added in HIVE-13149.
> The fix is simple, and here's the patch we applied in CDH:
> https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0
> Fixing that doesn't affect any existing Hive version support, but will make 
> it easier to support 2.2 when it's out.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14698) CREATE FUNCTION cloud not add function to hive metastore

2017-03-02 Thread poseidon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893565#comment-15893565
 ] 

poseidon commented on SPARK-14698:
--

[~azeroth2b] 
I think in spark 1.6.1, author do it on purpose. If this bug fixed, function 
can store in DB, but can't not loaded again on thrift-server restart.

But i can upload the patch anyway.

spark-1.6.1\sql\hive\src\main\scala\org\apache\spark\sql\hive\HiveContext.scala
private def functionOrMacroDDLPattern(command: String) = Pattern.compile(
".*(create|drop)\\s+(temporary\\s+)(function|macro).+", 
Pattern.DOTALL).matcher(command)

this is the correct regular-expression to lead create function command stored 
in DB

> CREATE FUNCTION cloud not add function to hive metastore
> 
>
> Key: SPARK-14698
> URL: https://issues.apache.org/jira/browse/SPARK-14698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: spark1.6.1
>Reporter: poseidon
>  Labels: easyfix
>
> build spark 1.6.1 , and run it with 1.2.1 hive version,config mysql as 
> metastore server. 
> Start a thrift server , then in beeline , try to CREATE FUNCTION as HIVE SQL 
> UDF. 
> find out , can not add this FUNCTION to mysql metastore,but the function 
> usage goes well.
> if you try to add it again , thrift server throw a alread Exist Exception.
> [SPARK-10151][SQL] Support invocation of hive macro 
> add a if condition when runSqlHive, which will exec create function in 
> hiveexec. caused this problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19349) Check resource ready to avoid multiple receivers to be scheduled on the same node.

2017-03-02 Thread Genmao Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu closed SPARK-19349.
-
Resolution: Won't Fix

> Check resource ready to avoid multiple receivers to be scheduled on the same 
> node.
> --
>
> Key: SPARK-19349
> URL: https://issues.apache.org/jira/browse/SPARK-19349
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>
> Currently,  we can only ensure registered resource satisfy the 
> "spark.scheduler.minRegisteredResourcesRatio". But if 
> "spark.scheduler.minRegisteredResourcesRatio" is set too small, receivers may 
> still be scheduled to few nodes. In fact, we can give once more chance to 
> wait for sufficient resource to schedule receiver evenly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19750) Spark UI http -> https redirect error

2017-03-02 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19750.

   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.1.1
   2.0.3

> Spark UI http -> https redirect error
> -
>
> Key: SPARK-19750
> URL: https://issues.apache.org/jira/browse/SPARK-19750
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.0.3, 2.1.1
>
>
> Spark's http redirect uses port 0 as a secure port to do redirect if http 
> port is not set, this will introduce {{ java.net.NoRouteToHostException: 
> Can't assign requested address }}, so here fixed to use bound port for 
> redirect.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-03-02 Thread Yun Ni (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893508#comment-15893508
 ] 

Yun Ni commented on SPARK-19771:


[~merlin] What you are suggesting is to hash each AND hash vector into a single 
integer, which I don't think make sense. It does little improvement to running 
time since SparkSQL does a hash join and the chance of vector comparison is 
almost minimized. It improves the memory cost of each transformed row from 
O(NumHashFunctions*NumHashTables) to O(NumHashTables) but at the cost of 
increasing false positive rate especially when the NumHashFunctions is large.

>From user experience perspective, hiding the actual hash values from users is 
>a bad practice because users need to run their own algorithms based on the 
>hash values. Besides that, we expect users to increase the number of hash 
>functions when they want to lower the false positive rate. Hashing the vector 
>will increase the false positive rate again, which should not be expected.

> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling

2017-03-02 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19276.

   Resolution: Fixed
 Assignee: Imran Rashid
Fix Version/s: 2.2.0

> FetchFailures can be hidden by user (or sql) exception handling
> ---
>
> Key: SPARK-19276
> URL: https://issues.apache.org/jira/browse/SPARK-19276
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Critical
> Fix For: 2.2.0
>
>
> The scheduler handles node failures by looking for a special 
> {{FetchFailedException}} thrown by the shuffle block fetcher.  This is 
> handled in {{Executor}} and then passed as a special msg back to the driver: 
> https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403
> However, user code exists in between the shuffle block fetcher and that catch 
> block -- it could intercept the exception, wrap it with something else, and 
> throw a different exception.  If that happens, spark treats it as an ordinary 
> task failure, and retries the task, rather than regenerating the missing 
> shuffle data.  The task eventually is retried 4 times, its doomed to fail 
> each time, and the job is failed.
> You might think that no user code should do that -- but even sparksql does it:
> https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214
> Here's an example stack trace.  This is from Spark 1.6, so the sql code is 
> not the same, but the problem is still there:
> {noformat}
> 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while 
> writing rows.
> at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect 
> to xxx/yyy:zzz
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
> ...
> 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 
> failed 4 times; aborting job
> {noformat}
> I think the right fix here is to also set a fetch failure status in the 
> {{TaskContextImpl}}, so the executor can check that instead of just one 
> exception.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19804) HiveClientImpl does not work with Hive 2.2.0 metastore

2017-03-02 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-19804:
--

 Summary: HiveClientImpl does not work with Hive 2.2.0 metastore
 Key: SPARK-19804
 URL: https://issues.apache.org/jira/browse/SPARK-19804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marcelo Vanzin
Priority: Minor


I know that Spark currently does not officially support Hive 2.2 (perhaps 
because it hasn't been released yet); but we have some 2.2 patches in CDH and 
the current code in the isolated client fails. The most probably culprit are 
changes added in HIVE-13149.

The fix is simple, and here's the patch we applied in CDH:
https://github.com/cloudera/spark/commit/954f060afe6ed469e85d656abd02790a79ec07a0

Fixing that doesn't affect any existing Hive version support, but will make it 
easier to support 2.2 when it's out.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893376#comment-15893376
 ] 

Kay Ousterhout commented on SPARK-19796:


Do you think we should (separately) fix the underlying problem?  Specifically, 
we could:

(a) not send the SPARK_JOB_DESCRIPTION property to the workers, since it's only 
used on the master for the UI (and while users *could* access it, the variable 
name SPARK_JOB_DESCRIPTION is spark-private, which suggests that it shouldn't 
be used by users).  Perhaps this is too risky because users could be using it?

(b) Truncate SPARK_JOB_DESCRIPTION to something reasonable (100 characters?) 
before sending it to the workers.  This is more backwards compatible if users 
are actually reading the property, but maybe a useless intermediate approach?

(c) (Possibly in addition to one of the above) Log a warning if any of the 
properties is longer than 100 characters (or some threshold).

Thoughts?  I can file a JIRA if you think any of these is worthwhile.

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19631) OutputCommitCoordinator should not allow commits for already failed tasks

2017-03-02 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout reassigned SPARK-19631:
--

Assignee: Patrick Woody

> OutputCommitCoordinator should not allow commits for already failed tasks
> -
>
> Key: SPARK-19631
> URL: https://issues.apache.org/jira/browse/SPARK-19631
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Patrick Woody
>Assignee: Patrick Woody
> Fix For: 2.2.0
>
>
> This is similar to SPARK-6614, but there a race condition where a task may 
> fail (e.g. Executor heartbeat timeout) and still manage to go through the 
> commit protocol successfully. After this any retries of the task will fail 
> indefinitely because of TaskCommitDenied.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19631) OutputCommitCoordinator should not allow commits for already failed tasks

2017-03-02 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19631.

   Resolution: Fixed
Fix Version/s: 2.2.0

> OutputCommitCoordinator should not allow commits for already failed tasks
> -
>
> Key: SPARK-19631
> URL: https://issues.apache.org/jira/browse/SPARK-19631
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Patrick Woody
> Fix For: 2.2.0
>
>
> This is similar to SPARK-6614, but there a race condition where a task may 
> fail (e.g. Executor heartbeat timeout) and still manage to go through the 
> commit protocol successfully. After this any retries of the task will fail 
> indefinitely because of TaskCommitDenied.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-03-02 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893222#comment-15893222
 ] 

Andrew Ash commented on SPARK-18113:


We discovered another bug related to committing that causes task deadloop and 
have work being done in SPARK-19631 to fix it.

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> at 
> scala.concurrent.BlockContext$Def

[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-03-02 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893117#comment-15893117
 ] 

Mingjie Tang commented on SPARK-19771:
--

(1) because you need to explode each tuple. For example mentioned above, for 
one input tuple, you have to build 3 rows, and each hashvalue contain a vector 
is the length of hash functions. thus, for one tuple, your memory overhead is 
NumHashFunctions*NumHashTables=15. Thus, if the number input tuple is N, the 
overhead is NumHashFunctions*NumHashTables*N. 

(2) yes, the hashvalue can be any based on your input bucketwidth W. Actually, 
it should be very big for less collision.

(3) I am not sure the hashCode can work, because we need to use this function 
for multi-probe searching.  

> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-03-02 Thread Yun Ni (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893097#comment-15893097
 ] 

Yun Ni edited comment on SPARK-19771 at 3/2/17 9:55 PM:


[~merlin] 
(1) The computation cost is NumHashFunctions because we go through each index 
only once. I don't know what's N in the memory overhead?
(2) The hash values are not necessarily 0, 1, -1.
(3) If we really want a hash function of Vector, why not use Vector.hashCode?



was (Author: yunn):
[~merlin] 
(1) The computation cost is NumHashFunctions because we go through each index 
only once. I don't know what's N in the memory overhead?
(2) The hash values are not necessarily {0, 1, -1}.
(3) If we really want a hash function of Vector, why not use Vector.hashCode?


> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19771) Support OR-AND amplification in Locality Sensitive Hashing (LSH)

2017-03-02 Thread Yun Ni (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893097#comment-15893097
 ] 

Yun Ni commented on SPARK-19771:


[~merlin] 
(1) The computation cost is NumHashFunctions because we go through each index 
only once. I don't know what's N in the memory overhead?
(2) The hash values are not necessarily {0, 1, -1}.
(3) If we really want a hash function of Vector, why not use Vector.hashCode?


> Support OR-AND amplification in Locality Sensitive Hashing (LSH)
> 
>
> Key: SPARK-19771
> URL: https://issues.apache.org/jira/browse/SPARK-19771
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yun Ni
>
> The current LSH implementation only supports AND-OR amplification. We need to 
> discuss the following questions before we goes to implementations:
> (1) Whether we should support OR-AND amplification
> (2) What API changes we need for OR-AND amplification
> (3) How we fix the approxNearestNeighbor and approxSimilarityJoin internally.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-03-02 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893087#comment-15893087
 ] 

Mingjie Tang commented on SPARK-18454:
--

[~yunn] the current multi-probe NNS can be improved without building index. 

 



> Changes to improve Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1693) Dependent on multiple versions of servlet-api jars lead to throw an SecurityException when Spark built for hadoop 2.3.0 , 2.4.0

2017-03-02 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893074#comment-15893074
 ] 

Andrew Otto commented on SPARK-1693:


We just upgraded to CDH 5.10, which has Spark 1.6.0, Hadoop 2.6.0, Hive 1.1.0, 
and Oozie 4.1.0.

We are having trouble running Spark jobs that use HiveContext from Oozie.  They 
run perfectly fine from the CLI with spark-submit, just not in Oozie.  We 
aren't certain that HiveContext is related, but we can reproduce regularly with 
a job that uses HiveContext.

Anyway, I post this here, because the error we are getting is the same that 
started this issue:

{code}class "javax.servlet.FilterRegistration"'s signer information does not 
match signer information of other classes in the same package{code}

I've noticed that the Oozie sharelib includes 
javax.servlet-3.0.0.v201112011016.jar.  I also see that spark-assembly.jar 
includes a javax.servlet.FilterRegistration class, although its hard for me to 
tell which version.  The jetty pom.xml files in spark-assembly.jar seem to say 
{{javax.servlet.*;version="2.6.0"}}, but I'm a little green on how all these 
dependencies get resolved.  I don't see any javax.servlet .jars in any of 
/usr/lib/hadoop* (where CDH installs hadoop jars).

Help!  :)  If this is not related to this issue, I'll open a new one.


> Dependent on multiple versions of servlet-api jars lead to throw an 
> SecurityException when Spark built for hadoop 2.3.0 , 2.4.0 
> 
>
> Key: SPARK-1693
> URL: https://issues.apache.org/jira/browse/SPARK-1693
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: log.txt
>
>
> {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 > 
> log.txt{code}
> The log: 
> {code}
> UnpersistSuite:
> - unpersist RDD *** FAILED ***
>   java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s 
> signer information does not match signer information of other classes in the 
> same package
>   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
>   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
>   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1693) Dependent on multiple versions of servlet-api jars lead to throw an SecurityException when Spark built for hadoop 2.3.0 , 2.4.0

2017-03-02 Thread Andrew Otto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893074#comment-15893074
 ] 

Andrew Otto edited comment on SPARK-1693 at 3/2/17 9:42 PM:


We just upgraded to CDH 5.10, which has Spark 1.6.0, Hadoop 2.6.0, Hive 1.1.0, 
and Oozie 4.1.0.

We are having trouble running Spark jobs that use HiveContext from Oozie.  They 
run perfectly fine from the CLI with spark-submit, just not in Oozie.  We 
aren't certain that HiveContext is related, but we can reproduce regularly with 
a job that uses HiveContext.

Anyway, I post this here, because the error we are getting is the same that 
started this issue:

{code}class "javax.servlet.FilterRegistration"'s signer information does not 
match signer information of other classes in the same package{code}

I've noticed that the Oozie sharelib includes 
javax.servlet-3.0.0.v201112011016.jar.  I also see that spark-assembly.jar 
includes a javax.servlet.FilterRegistration class, although its hard for me to 
tell which version.  The jetty pom.xml files in spark-assembly.jar seem to say 
{{javax.servlet.\*;version="2.6.0"}}, but I'm a little green on how all these 
dependencies get resolved.  I don't see any javax.servlet .jars in any of 
/usr/lib/hadoop* (where CDH installs hadoop jars).

Help!  :)  If this is not related to this issue, I'll open a new one.



was (Author: ottomata):
We just upgraded to CDH 5.10, which has Spark 1.6.0, Hadoop 2.6.0, Hive 1.1.0, 
and Oozie 4.1.0.

We are having trouble running Spark jobs that use HiveContext from Oozie.  They 
run perfectly fine from the CLI with spark-submit, just not in Oozie.  We 
aren't certain that HiveContext is related, but we can reproduce regularly with 
a job that uses HiveContext.

Anyway, I post this here, because the error we are getting is the same that 
started this issue:

{code}class "javax.servlet.FilterRegistration"'s signer information does not 
match signer information of other classes in the same package{code}

I've noticed that the Oozie sharelib includes 
javax.servlet-3.0.0.v201112011016.jar.  I also see that spark-assembly.jar 
includes a javax.servlet.FilterRegistration class, although its hard for me to 
tell which version.  The jetty pom.xml files in spark-assembly.jar seem to say 
{{javax.servlet.*;version="2.6.0"}}, but I'm a little green on how all these 
dependencies get resolved.  I don't see any javax.servlet .jars in any of 
/usr/lib/hadoop* (where CDH installs hadoop jars).

Help!  :)  If this is not related to this issue, I'll open a new one.


> Dependent on multiple versions of servlet-api jars lead to throw an 
> SecurityException when Spark built for hadoop 2.3.0 , 2.4.0 
> 
>
> Key: SPARK-1693
> URL: https://issues.apache.org/jira/browse/SPARK-1693
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: log.txt
>
>
> {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 > 
> log.txt{code}
> The log: 
> {code}
> UnpersistSuite:
> - unpersist RDD *** FAILED ***
>   java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s 
> signer information does not match signer information of other classes in the 
> same package
>   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
>   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
>   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests

2017-03-02 Thread Sital Kedia (JIRA)

Sital Kedia created SPARK-19803:
---

 Summary: Flaky BlockManagerProactiveReplicationSuite tests
 Key: SPARK-19803
 URL: https://issues.apache.org/jira/browse/SPARK-19803
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Sital Kedia


The tests added for BlockManagerProactiveReplicationSuite has made the jenkins 
build flaky. Please refer to the build for more details - 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19802) Remote History Server

2017-03-02 Thread Ben Barnard (JIRA)

Ben Barnard created SPARK-19802:
---

 Summary: Remote History Server
 Key: SPARK-19802
 URL: https://issues.apache.org/jira/browse/SPARK-19802
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Ben Barnard


Currently the history server expects to find history in a filesystem somewhere. 
It would be nice to have a history server that listens for application events 
on a TCP port, and have a EventLoggingListener that sends events to the 
listening history server instead of writing to a file. This would allow the 
history server to show up-to-date history for past and running jobs in a 
cluster environment that lacks a shared filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19801) Remove JDK7 from Travis CI

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892817#comment-15892817
 ] 

Apache Spark commented on SPARK-19801:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17143

> Remove JDK7 from Travis CI
> --
>
> Key: SPARK-19801
> URL: https://issues.apache.org/jira/browse/SPARK-19801
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR 
> verification (JDK7/JDK8 maven compilation and Java Linter) and contributors 
> can see the additional result via their Travis CI dashboard (or PC).
> This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was 
> removed via SPARK-19550.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19801) Remove JDK7 from Travis CI

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19801:


Assignee: Apache Spark

> Remove JDK7 from Travis CI
> --
>
> Key: SPARK-19801
> URL: https://issues.apache.org/jira/browse/SPARK-19801
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR 
> verification (JDK7/JDK8 maven compilation and Java Linter) and contributors 
> can see the additional result via their Travis CI dashboard (or PC).
> This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was 
> removed via SPARK-19550.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19801) Remove JDK7 from Travis CI

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19801:


Assignee: (was: Apache Spark)

> Remove JDK7 from Travis CI
> --
>
> Key: SPARK-19801
> URL: https://issues.apache.org/jira/browse/SPARK-19801
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR 
> verification (JDK7/JDK8 maven compilation and Java Linter) and contributors 
> can see the additional result via their Travis CI dashboard (or PC).
> This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was 
> removed via SPARK-19550.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19801) Remove JDK7 from Travis CI

2017-03-02 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-19801:
-

 Summary: Remove JDK7 from Travis CI
 Key: SPARK-19801
 URL: https://issues.apache.org/jira/browse/SPARK-19801
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.1.0
Reporter: Dongjoon Hyun
Priority: Minor


Since Spark 2.1.0, Travis CI was supported by SPARK-15207 for automated PR 
verification (JDK7/JDK8 maven compilation and Java Linter) and contributors can 
see the additional result via their Travis CI dashboard (or PC).

This issue aims to make `.travis.yml` up-to-date by removing JDK7 which was 
removed via SPARK-19550.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19720) Redact sensitive information from SparkSubmit console output

2017-03-02 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19720.

   Resolution: Fixed
 Assignee: Mark Grover
Fix Version/s: 2.2.0

> Redact sensitive information from SparkSubmit console output
> 
>
> Key: SPARK-19720
> URL: https://issues.apache.org/jira/browse/SPARK-19720
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.0
>Reporter: Mark Grover
>Assignee: Mark Grover
> Fix For: 2.2.0
>
>
> SPARK-18535 took care of redacting sensitive information from Spark event 
> logs and UI. However, it intentionally didn't bother redacting the same 
> sensitive information from SparkSubmit's console output because it was on the 
> client's machine, which already had the sensitive information on disk (in 
> spark-defaults.conf) or on terminal (spark-submit command line).
> However, it seems now that it's better to redact information from 
> SparkSubmit's console output as well because orchestration software like 
> Oozie usually expose SparkSubmit's console output via a UI. To make matters 
> worse, Oozie, in particular, always sets the {{--verbose}} flag on 
> SparkSubmit invocation, making the sensitive information readily available in 
> its UI (see 
> [code|https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org/apache/oozie/action/hadoop/SparkMain.java#L248]
>  here).
> This is a JIRA for tracking redaction of sensitive information from 
> SparkSubmit's console output.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11197) Run SQL query on files directly without create a table

2017-03-02 Thread Ladislav Jech (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892706#comment-15892706
 ] 

Ladislav Jech commented on SPARK-11197:
---

Grat stuff!

> Run SQL query on files directly without create a table
> --
>
> Key: SPARK-11197
> URL: https://issues.apache.org/jira/browse/SPARK-11197
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> It's useful to run SQL query directly on files without creating a table, as 
> people done with Apache Drill. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18699) Spark CSV parsing types other than String throws exception when malformed

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892679#comment-15892679
 ] 

Apache Spark commented on SPARK-18699:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17142

> Spark CSV parsing types other than String throws exception when malformed
> -
>
> Key: SPARK-18699
> URL: https://issues.apache.org/jira/browse/SPARK-18699
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Jakub Nowacki
>Assignee: Takeshi Yamamuro
> Fix For: 2.2.0
>
>
> If CSV is read and the schema contains any other type than String, exception 
> is thrown when the string value in CSV is malformed; e.g. if the timestamp 
> does not match the defined one, an exception is thrown:
> {code}
> Caused by: java.lang.IllegalArgumentException
>   at java.sql.Date.valueOf(Date.java:143)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply$mcJ$sp(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$$anonfun$castTo$6.apply(CSVInferSchema.scala:272)
>   at scala.util.Try.getOrElse(Try.scala:79)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:85)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anonfun$buildReader$1$$anonfun$apply$2.apply(CSVFileFormat.scala:127)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258)
>   ... 8 more
> {code}
> It behaves similarly with Integer and Long types, from what I've seen.
> To my understanding modes PERMISSIVE and DROPMALFORMED should just null the 
> value or drop the line, but instead they kill the job.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19800) Implement one kind of streaming sampling - reservoir sampling

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19800:


Assignee: (was: Apache Spark)

> Implement one kind of streaming sampling - reservoir sampling
> -
>
> Key: SPARK-19800
> URL: https://issues.apache.org/jira/browse/SPARK-19800
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19800) Implement one kind of streaming sampling - reservoir sampling

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892571#comment-15892571
 ] 

Apache Spark commented on SPARK-19800:
--

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/17141

> Implement one kind of streaming sampling - reservoir sampling
> -
>
> Key: SPARK-19800
> URL: https://issues.apache.org/jira/browse/SPARK-19800
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19800) Implement one kind of streaming sampling - reservoir sampling

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19800:


Assignee: Apache Spark

> Implement one kind of streaming sampling - reservoir sampling
> -
>
> Key: SPARK-19800
> URL: https://issues.apache.org/jira/browse/SPARK-19800
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19800) Implement one kind of streaming sampling - reservoir sampling

2017-03-02 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-19800:
-

 Summary: Implement one kind of streaming sampling - reservoir 
sampling
 Key: SPARK-19800
 URL: https://issues.apache.org/jira/browse/SPARK-19800
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892547#comment-15892547
 ] 

Imran Rashid commented on SPARK-19796:
--

[~kayousterhout] [~shivaram] here's another example of serializing lots of 
pointless data in each task -- in this case, {{TaskDescription.properties}} 
contains lots of data which the executors don't care about.  and this gets 
serialized once per task.

For this jira, I'll just do a small fix, but I thought you might be interested 
in this.

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19766) INNER JOIN on constant alias columns return incorrect results

2017-03-02 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-19766:

Fix Version/s: 2.0.3

> INNER JOIN on constant alias columns return incorrect results
> -
>
> Key: SPARK-19766
> URL: https://issues.apache.org/jira/browse/SPARK-19766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
>Priority: Critical
>  Labels: Correctness
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> We can demonstrate the problem with the following data set and query:
> {code}
> val spark = 
> SparkSession.builder().appName("test").master("local").getOrCreate()
> val sql1 =
>   """
> |create temporary view t1 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql2 =
>   """
> |create temporary view t2 as select * from values
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql3 =
>   """
> |create temporary view t3 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sql4 =
>   """
> |create temporary view t4 as select * from values
> |(1),
> |(1)
> |as grouping(a)
>   """.stripMargin
> val sqlA =
>   """
> |create temporary view ta as
> |select a, 'a' as tag from t1 union all
> |select a, 'b' as tag from t2
>   """.stripMargin
> val sqlB =
>   """
> |create temporary view tb as
> |select a, 'a' as tag from t3 union all
> |select a, 'b' as tag from t4
>   """.stripMargin
> val sql =
>   """
> |select tb.* from ta inner join tb on
> |ta.a = tb.a and
> |ta.tag = tb.tag
>   """.stripMargin
> spark.sql(sql1)
> spark.sql(sql2)
> spark.sql(sql3)
> spark.sql(sql4)
> spark.sql(sqlA)
> spark.sql(sqlB)
> spark.sql(sql).show()
> {code}
> The results which is incorrect:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> |  1|  a|
> |  1|  a|
> +---+---+
> {code}
> The correct results should be:
> {code}
> +---+---+
> |  a|tag|
> +---+---+
> |  1|  a|
> |  1|  a|
> |  1|  b|
> |  1|  b|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19796:


Assignee: Apache Spark

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Assignee: Apache Spark
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19796:


Assignee: (was: Apache Spark)

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892482#comment-15892482
 ] 

Apache Spark commented on SPARK-19796:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/17140

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19799) Support WITH clause in subqueries

2017-03-02 Thread Giambattista (JIRA)

Giambattista created SPARK-19799:


 Summary: Support WITH clause in subqueries
 Key: SPARK-19799
 URL: https://issues.apache.org/jira/browse/SPARK-19799
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Giambattista


Because of Spark-17590 it should be relatively easy to support WITH clause in 
subqueries besides nested CTE definitions.

Here an example of a query that does not run on spark:
create table test (seqno int, k string, v int) using parquet;
insert into TABLE test values (1,'a', 99),(2, 'b', 88),(3, 'a', 77),(4, 'b', 
66),(5, 'c', 55),(6, 'a', 44),(7, 'b', 33);
SELECT percentile(b, 0.5) FROM (WITH mavg AS (SELECT k, AVG(v) OVER (PARTITION 
BY k ORDER BY seqno ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) as b FROM test 
ORDER BY seqno) SELECT k, MAX(b) as b  FROM mavg GROUP BY k);



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892390#comment-15892390
 ] 

Imran Rashid commented on SPARK-19796:
--

Since its a regression, I'm making this a blocker for 2.2.0  (or else we revert 
SPARK-17931, but the fix should be simple).

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-02 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-19796:
-
Priority: Blocker  (was: Major)

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892363#comment-15892363
 ] 

Apache Spark commented on SPARK-18890:
--

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/17139

> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> 
>
> Key: SPARK-18890
> URL: https://issues.apache.org/jira/browse/SPARK-18890
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17080) join reorder

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892325#comment-15892325
 ] 

Apache Spark commented on SPARK-17080:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17138

> join reorder
> 
>
> Key: SPARK-17080
> URL: https://issues.apache.org/jira/browse/SPARK-17080
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We decide the join order of a multi-way join query based on the cost function 
> defined in the spec.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17080) join reorder

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17080:


Assignee: Apache Spark

> join reorder
> 
>
> Key: SPARK-17080
> URL: https://issues.apache.org/jira/browse/SPARK-17080
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Apache Spark
>
> We decide the join order of a multi-way join query based on the cost function 
> defined in the spec.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19798) Query returns stale results when tables are modified on other sessions

2017-03-02 Thread Giambattista (JIRA)

Giambattista created SPARK-19798:


 Summary: Query returns stale results when tables are modified on 
other sessions
 Key: SPARK-19798
 URL: https://issues.apache.org/jira/browse/SPARK-19798
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Giambattista


I observed the problem on master branch with thrift server in multisession mode 
(default), but I was able to replicate also with spark-shell as well (see below 
the sequence for replicating).
I observed cases where changes made in a session (table insert, table renaming) 
are not visible to other derived sessions (created with session.newSession).

The problem seems due to the fact that each session has its own 
tableRelationCache and it does not get refreshed.
IMO tableRelationCache should be shared in sharedState, maybe in the 
cacheManager so that refresh of caches for data that is not session-specific 
such as temporary tables gets centralized.  

--- Spark shell script

val spark2 = spark.newSession
spark.sql("CREATE TABLE test (a int) using parquet")
spark2.sql("select * from test").show // OK returns empty
spark.sql("select * from test").show // OK returns empty
spark.sql("insert into TABLE test values 1,2,3")
spark2.sql("select * from test").show // ERROR returns empty
spark.sql("select * from test").show // OK returns 3,2,1
spark.sql("create table test2 (a int) using parquet")
spark.sql("insert into TABLE test2 values 4,5,6")
spark2.sql("select * from test2").show // OK returns 6,4,5
spark.sql("select * from test2").show // OK returns 6,4,5
spark.sql("alter table test rename to test3")
spark.sql("alter table test2 rename to test")
spark.sql("alter table test3 rename to test2")
spark2.sql("select * from test").show // ERROR returns empty
spark.sql("select * from test").show // OK returns 6,4,5
spark2.sql("select * from test2").show // ERROR throws 
java.io.FileNotFoundException
spark.sql("select * from test2").show // OK returns 3,1,2





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17080) join reorder

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17080:


Assignee: (was: Apache Spark)

> join reorder
> 
>
> Key: SPARK-17080
> URL: https://issues.apache.org/jira/browse/SPARK-17080
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We decide the join order of a multi-way join query based on the cost function 
> defined in the spec.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18769) Spark to be smarter about what the upper bound is and to restrict number of executor when dynamic allocation is enabled

2017-03-02 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892318#comment-15892318
 ] 

Thomas Graves commented on SPARK-18769:
---

I definitely understand there is an actual problem here, but I think the 
problem is more with Spark and its event processing/synchronization then the 
fact we are asking for more containers.Like I mention I agree with doing 
the jira I just want to clarify why we are doing it and make sure we do it such 
that it doesn't hurt our container allocation.  Its always good to play nice in 
the yarn environment and not ask for more containers then the entire cluster 
can handle for instance, but at the same time if we are limiting the container 
requests early on, yarn could easily free up resource and make them available 
for you.  If you don't have your request in yarn could give those to someone 
else.  There are a lot of configs in the yarn schedulers and different 
situations.If you look at some other apps on yarn (MR and TEZ), both 
immediately ask for all of their resource.  MR is definitely different since it 
doesn't reuse containers, TEZ does. With asking for everything immediately you 
can definitely hit issues where if your tasks run really fast then you don't 
need all of those containers, but the exponential ramp up on our allocation now 
gets you their really quickly anyway and I think you can hit the same issue. 

Note that in our clusters we set the upper limit by default to something 
reasonable (couple thousand) and if someone has really large job they can 
reconfigure.
 

>  Spark to be smarter about what the upper bound is and to restrict number of 
> executor when dynamic allocation is enabled
> 
>
> Key: SPARK-18769
> URL: https://issues.apache.org/jira/browse/SPARK-18769
> Project: Spark
>  Issue Type: New Feature
>Reporter: Neerja Khattar
>
> Currently when dynamic allocation is enabled max.executor is infinite and 
> spark creates so many executor and even exceed the yarn nodemanager memory 
> limit and vcores.
> It should have a check to not exceed more that yarn resource limit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS

2017-03-02 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19345.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Add doc for "coldStartStrategy" usage in ALS
> 
>
> Key: SPARK-19345
> URL: https://issues.apache.org/jira/browse/SPARK-19345
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
> Fix For: 2.2.0
>
>
> SPARK-14489 adds the ability to skip {{NaN}} predictions during 
> {{ALS.transform}}. This can be useful in production scenarios but is 
> particularly useful when trying to use the cross-validation classes with ALS, 
> since in many cases the test set will have users/items that are not in the 
> training set, leading to evaluation metrics that are all {{NaN}} and making 
> cross-validation unusable.
> Add an explanation for the {{coldStartStrategy}} param to the ALS 
> documentation, and add example code to illustrate the usage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19345) Add doc for "coldStartStrategy" usage in ALS

2017-03-02 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-19345:
---
Priority: Minor  (was: Major)

> Add doc for "coldStartStrategy" usage in ALS
> 
>
> Key: SPARK-19345
> URL: https://issues.apache.org/jira/browse/SPARK-19345
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>Priority: Minor
> Fix For: 2.2.0
>
>
> SPARK-14489 adds the ability to skip {{NaN}} predictions during 
> {{ALS.transform}}. This can be useful in production scenarios but is 
> particularly useful when trying to use the cross-validation classes with ALS, 
> since in many cases the test set will have users/items that are not in the 
> training set, leading to evaluation metrics that are all {{NaN}} and making 
> cross-validation unusable.
> Add an explanation for the {{coldStartStrategy}} param to the ALS 
> documentation, and add example code to illustrate the usage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892216#comment-15892216
 ] 

Sean Owen commented on SPARK-19797:
---

Yes, it's not true of scoring though, and the difference in fitting won't 
matter to the caller though.

> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Zhe Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892189#comment-15892189
 ] 

Zhe Sun edited comment on SPARK-19797 at 3/2/17 12:52 PM:
--

Hi Sean, thanks for your quick reply. 

bq. If the Pipeline had more stages, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.

Let's use IDF as an example. If the pipeline is like:
bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression
When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ 
and pass the idf result to LogisticRegression. Because LogisticRegression is an 
Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of 
*IDF*.

However, if the last stage of pipeline is Normalizer 
(https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer)
bq. Tokenizer -> HashingTF -> IDF -> Normalizer 
When fitting this pipeline, *IDF* will only call _fit_, and do not need to call 
_transform_

That's why I think it is better to modify the description as below to make it 
accurate.
bq. If the Pipeline had more Estimators, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.



was (Author: ymwdalex):
Hi Sean, thanks for your quick reply. 

bq. If the Pipeline had more stages, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.

Let's use IDF as an example. If the pipeline is like:
bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression
When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ 
and pass the idf result to LogisticRegression. Because LogisticRegression is an 
Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of 
*IDF*.

However, if the last stage of pipeline is Normalizer 
(https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer)
bq. Tokenizer -> HashingTF -> IDF -> Normalizer 
When fitting this pipeline, *IDF* will only call _fit_, and do not need to call 
_transform_

That's why I think it is better to correct the description as 
bq. If the Pipeline had more Estimators, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.


> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Zhe Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892189#comment-15892189
 ] 

Zhe Sun commented on SPARK-19797:
-

Hi Sean, thanks for your quick reply. 

bq. If the Pipeline had more stages, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.

Let's use IDF as an example. If the pipeline is like:
bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression
When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ 
and pass the idf result to LogisticRegression. Because LogisticRegression is an 
Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of 
*IDF*.

However, if the last stage of pipeline is Normalizer 
(https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer)
bq. Tokenizer -> HashingTF -> IDF -> Normalizer 
When fitting this pipeline, *IDF* will only call _fit_, and do not need to call 
_transform_

That's why I think it is better to correct the description as 
bq. If the Pipeline had more Estimators, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.


> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19797:


Assignee: Apache Spark

> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19797:


Assignee: (was: Apache Spark)

> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

2017-03-02 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892169#comment-15892169
 ] 

Takeshi Yamamuro commented on SPARK-19503:
--

I'm not sure this should be fixed though, postgresql leaves this kind of 
sorting as it is;
{code}
postgres=# \d testTable
   Table "public.testTable"
 Column |  Type   | Modifiers 
+-+---
 key| integer | 
 value  | integer |

postgres=# select count(*) from (select * from testTable order by key) t;
 count 
---
 1
(1 row)

postgres=# explain select count(*) from (select * from testTable order by key) 
t;
   QUERY PLAN   

 Aggregate  (cost=192.41..192.42 rows=1 width=0)
   ->  Sort  (cost=158.51..164.16 rows=2260 width=4)
 Sort Key: testTable.key
 ->  Seq Scan on testTable  (cost=0.00..32.60 rows=2260 width=4)
(4 rows)
{code}

> Execution Plan Optimizer: avoid sort or shuffle when it does not change end 
> result such as df.sort(...).count()
> ---
>
> Key: SPARK-19503
> URL: https://issues.apache.org/jira/browse/SPARK-19503
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
> Environment: Perhaps only a pyspark or databricks AWS issue
>Reporter: R
>Priority: Minor
>  Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not 
> required here and makes me wonder how smart the algebraic optimiser is 
> indeed! The data may be partitioned by known count (such as parquet files) 
> and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder 
> what else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19503) Execution Plan Optimizer: avoid sort or shuffle when it does not change end result such as df.sort(...).count()

2017-03-02 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892169#comment-15892169
 ] 

Takeshi Yamamuro edited comment on SPARK-19503 at 3/2/17 12:39 PM:
---

I'm not sure this should be fixed though, postgresql leaves this kind of 
sorting as it is...;
{code}
postgres=# \d testTable
   Table "public.testTable"
 Column |  Type   | Modifiers 
+-+---
 key| integer | 
 value  | integer |

postgres=# select count(*) from (select * from testTable order by key) t;
 count 
---
 1
(1 row)

postgres=# explain select count(*) from (select * from testTable order by key) 
t;
   QUERY PLAN   

 Aggregate  (cost=192.41..192.42 rows=1 width=0)
   ->  Sort  (cost=158.51..164.16 rows=2260 width=4)
 Sort Key: testTable.key
 ->  Seq Scan on testTable  (cost=0.00..32.60 rows=2260 width=4)
(4 rows)
{code}


was (Author: maropu):
I'm not sure this should be fixed though, postgresql leaves this kind of 
sorting as it is;
{code}
postgres=# \d testTable
   Table "public.testTable"
 Column |  Type   | Modifiers 
+-+---
 key| integer | 
 value  | integer |

postgres=# select count(*) from (select * from testTable order by key) t;
 count 
---
 1
(1 row)

postgres=# explain select count(*) from (select * from testTable order by key) 
t;
   QUERY PLAN   

 Aggregate  (cost=192.41..192.42 rows=1 width=0)
   ->  Sort  (cost=158.51..164.16 rows=2260 width=4)
 Sort Key: testTable.key
 ->  Seq Scan on testTable  (cost=0.00..32.60 rows=2260 width=4)
(4 rows)
{code}

> Execution Plan Optimizer: avoid sort or shuffle when it does not change end 
> result such as df.sort(...).count()
> ---
>
> Key: SPARK-19503
> URL: https://issues.apache.org/jira/browse/SPARK-19503
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
> Environment: Perhaps only a pyspark or databricks AWS issue
>Reporter: R
>Priority: Minor
>  Labels: execution, optimizer, plan, query
>
> df.sort(...).count()
> performs shuffle and sort and then count! This is wasteful as sort is not 
> required here and makes me wonder how smart the algebraic optimiser is 
> indeed! The data may be partitioned by known count (such as parquet files) 
> and we should not shuffle to just perform count.
> This may look trivial, but if optimiser fails to recognise this, I wonder 
> what else is it missing especially in more complex operations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Zhe Sun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892170#comment-15892170
 ] 

Zhe Sun commented on SPARK-19797:
-

A pull request was created https://github.com/apache/spark/pull/17137

> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892168#comment-15892168
 ] 

Apache Spark commented on SPARK-19797:
--

User 'ymwdalex' has created a pull request for this issue:
https://github.com/apache/spark/pull/17137

> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892157#comment-15892157
 ] 

Sean Owen commented on SPARK-19797:
---

Hm, on second look, the placement of the sentence suggest it applies to 
fitting. It is a bit of an implementation detail that this is optimized away, 
and the user won't actually care whether the pointless transforms happen during 
fitting or not. It is probably OK as is, but, might be clearer to say something 
like, "has more stages that require the output of the LogisticRegressionModel 
to fit" or something?

> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19778) alais cannot use in group by

2017-03-02 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19778.
--
Resolution: Duplicate

I am resolving this as a duplicate of SPARK-14471

Please reopen this if I misunderstood.

> alais cannot use in group by
> 
>
> Key: SPARK-19778
> URL: https://issues.apache.org/jira/browse/SPARK-19778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: xukun
>
> not support “select key as key1 from src group by key1”



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892149#comment-15892149
 ] 

Sean Owen commented on SPARK-19797:
---

I don't think that's true. The resulting pipeline would contain a 
LogisticRegressionModel, and when invoked, its transform() method would be 
called, and the result passed to subsequent transformers if any. You are 
pointing out that the trailing transformations aren't necessary to compute when 
_fitting_ the pipeline. That's what the if statement here is optimizing away.

> ML pipelines document error
> ---
>
> Key: SPARK-19797
> URL: https://issues.apache.org/jira/browse/SPARK-19797
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zhe Sun
>Priority: Trivial
>  Labels: documentation
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Description about pipeline in this paragraph is incorrect 
> https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
> misleads the user
> bq. If the Pipeline had more *stages*, it would call the 
> LogisticRegressionModel’s transform() method on the DataFrame before passing 
> the DataFrame to the next stage.
> The description is not accurate, because *Transformer* could also be a stage. 
> But only another Estimator will invoke an extra transform call.
> So, the description should be corrected as: *If the Pipeline had more 
> _Estimators_*. 
> The code to prove it is here 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19783) Treat shorter/longer lengths of tokens as malformed records in CSV parser

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19783:


Assignee: Apache Spark

> Treat shorter/longer lengths of tokens as malformed records in CSV parser
> -
>
> Key: SPARK-19783
> URL: https://issues.apache.org/jira/browse/SPARK-19783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>
> If a length of tokens does not match an expected length in a schema, we 
> probably need to treat it as a malformed record.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L239



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19783) Treat shorter/longer lengths of tokens as malformed records in CSV parser

2017-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19783:


Assignee: (was: Apache Spark)

> Treat shorter/longer lengths of tokens as malformed records in CSV parser
> -
>
> Key: SPARK-19783
> URL: https://issues.apache.org/jira/browse/SPARK-19783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> If a length of tokens does not match an expected length in a schema, we 
> probably need to treat it as a malformed record.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L239



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19783) Treat shorter/longer lengths of tokens as malformed records in CSV parser

2017-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892146#comment-15892146
 ] 

Apache Spark commented on SPARK-19783:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17136

> Treat shorter/longer lengths of tokens as malformed records in CSV parser
> -
>
> Key: SPARK-19783
> URL: https://issues.apache.org/jira/browse/SPARK-19783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> If a length of tokens does not match an expected length in a schema, we 
> probably need to treat it as a malformed record.
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L239



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19797) ML pipelines document error

2017-03-02 Thread Zhe Sun (JIRA)

Zhe Sun created SPARK-19797:
---

 Summary: ML pipelines document error
 Key: SPARK-19797
 URL: https://issues.apache.org/jira/browse/SPARK-19797
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: Zhe Sun
Priority: Trivial


Description about pipeline in this paragraph is incorrect 
https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which 
misleads the user
bq. If the Pipeline had more *stages*, it would call the 
LogisticRegressionModel’s transform() method on the DataFrame before passing 
the DataFrame to the next stage.

The description is not accurate, because *Transformer* could also be a stage. 
But only another Estimator will invoke an extra transform call.

So, the description should be corrected as: *If the Pipeline had more 
_Estimators_*. 

The code to prove it is here 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-03-02 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-19704:
---
Fix Version/s: 2.2.0

> AFTSurvivalRegression should support numeric censorCol
> --
>
> Key: SPARK-19704
> URL: https://issues.apache.org/jira/browse/SPARK-19704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.2.0
>
>
> AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-03-02 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19704:
--

Assignee: zhengruifeng

> AFTSurvivalRegression should support numeric censorCol
> --
>
> Key: SPARK-19704
> URL: https://issues.apache.org/jira/browse/SPARK-19704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol

2017-03-02 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19704.

Resolution: Fixed

> AFTSurvivalRegression should support numeric censorCol
> --
>
> Key: SPARK-19704
> URL: https://issues.apache.org/jira/browse/SPARK-19704
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> AFTSurvivalRegression should support numeric censorCol



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-03-02 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath reassigned SPARK-19733:
--

Assignee: Vasilis Vryniotis

> ALS performs unnecessary casting on item and user ids
> -
>
> Key: SPARK-19733
> URL: https://issues.apache.org/jira/browse/SPARK-19733
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Vasilis Vryniotis
>Assignee: Vasilis Vryniotis
> Fix For: 2.2.0
>
>
> The ALS is performing unnecessary casting to the user and item ids (to 
> double). I believe this is because the protected checkedCast() method 
> requires a double input. This can be avoided by refactroing the code of 
> checkedCast method.
> Issue resolved by pull-request 17059:
> https://github.com/apache/spark/pull/17059



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-03-02 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath resolved SPARK-19733.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17059
[https://github.com/apache/spark/pull/17059]

> ALS performs unnecessary casting on item and user ids
> -
>
> Key: SPARK-19733
> URL: https://issues.apache.org/jira/browse/SPARK-19733
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Vasilis Vryniotis
> Fix For: 2.2.0
>
>
> The ALS is performing unnecessary casting to the user and item ids (to 
> double). I believe this is because the protected checkedCast() method 
> requires a double input. This can be avoided by refactroing the code of 
> checkedCast method.
> Issue resolved by pull-request 17059:
> https://github.com/apache/spark/pull/17059



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 116 matches

Mail list logo