[jira] [Updated] (SPARK-13591) Remove Back-ticks in Attribute/Alias Names

2016-02-29 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-13591:

Description: When calling .sql, back-ticks are automatically added. When 
using .sql as  AttributeReference/Alias names, we hit a couple of issues when 
doing logicalPlan to SQL. For example, the name could be converted to 
{{`sum(`name`)`}}. Parser is unable to recognize it.  (was: When calling .sql, 
back-ticks are automatically added. When using .sql as  
AttributeReference/Alias names, we hit a couple of issues when doing 
logicalPlan to SQL. The name could be converted to {{`sum(`name`)`}}. Parser is 
unable to recognize it.)

> Remove Back-ticks in Attribute/Alias Names
> --
>
> Key: SPARK-13591
> URL: https://issues.apache.org/jira/browse/SPARK-13591
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When calling .sql, back-ticks are automatically added. When using .sql as  
> AttributeReference/Alias names, we hit a couple of issues when doing 
> logicalPlan to SQL. For example, the name could be converted to 
> {{`sum(`name`)`}}. Parser is unable to recognize it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13591) Remove Back-ticks in Attribute/Alias Names

2016-02-29 Thread Xiao Li (JIRA)
Xiao Li created SPARK-13591:
---

 Summary: Remove Back-ticks in Attribute/Alias Names
 Key: SPARK-13591
 URL: https://issues.apache.org/jira/browse/SPARK-13591
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


When calling .sql, back-ticks are automatically added. When using .sql as  
AttributeReference/Alias names, we hit a couple of issues when doing 
logicalPlan to SQL. The name could be converted to {{`sum(`name`)`}}. Parser is 
unable to recognize it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13550) Add java example for ml.clustering.BisectingKMeans

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13550.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11428
[https://github.com/apache/spark/pull/11428]

> Add java example for ml.clustering.BisectingKMeans
> --
>
> Key: SPARK-13550
> URL: https://issues.apache.org/jira/browse/SPARK-13550
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.0.0
>
>
> Add java example for ml.clustering.BisectingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13550) Add java example for ml.clustering.BisectingKMeans

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13550:
--
Assignee: zhengruifeng

> Add java example for ml.clustering.BisectingKMeans
> --
>
> Key: SPARK-13550
> URL: https://issues.apache.org/jira/browse/SPARK-13550
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
>
> Add java example for ml.clustering.BisectingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13581) LibSVM throws MatchError

2016-02-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173383#comment-15173383
 ] 

Jeff Zhang commented on SPARK-13581:


I suspect it is issue in the code generation. Because the root cause is that it 
should read the column features but actually it read the column label, so cause 
the match error. And df.show() is successful without any selection.  The 
stacktrace shows the error come from code generator. Can any guy familiar with 
code generation help on this ?

{code}
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost 
task 0.0 in stage 5.0 (TID 5, localhost): scala.MatchError: 0.0 (of class 
java.lang.Double)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:63)
at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:60)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:40)
at 
org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$5$$anon$1.hasNext(WholeStageCodegen.scala:305)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
{code}

> LibSVM throws MatchError
> 
>
> Key: SPARK-13581
> URL: https://issues.apache.org/jira/browse/SPARK-13581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jakob Odersky
>Assignee: Jeff Zhang
>Priority: Minor
>
> When running an action on a DataFrame obtained by reading from a libsvm file 
> a MatchError is thrown, however doing the same on a cached DataFrame works 
> fine.
> {code}
> val df = 
> sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
> //file is in spark repository
> df.select(df("features")).show() //MatchError
> df.cache()
> df.select(df("features")).show() //OK
> {code}
> The exception stack trace is the following:
> {code}
> scala.MatchError: 1.0 (of class java.lang.Double)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
> {code}
> This issue first appeared in commit {{1dac964c1}}, in PR 
> [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622.
> [~jeffzhang], do you have any insight of what could be going on?
> cc [~iyounus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13590) Document the behavior of spark.ml logistic regression when there are constant features

2016-02-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-13590:
-

 Summary: Document the behavior of spark.ml logistic regression 
when there are constant features
 Key: SPARK-13590
 URL: https://issues.apache.org/jira/browse/SPARK-13590
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.0.0
Reporter: Xiangrui Meng


As discussed in SPARK-13029, we decided to keep the current behavior that sets 
all coefficients associated with constant feature columns to zero, regardless 
of intercept, regularization, and standardization settings. This is the same 
behavior as in glmnet. Since this is different from LIBSVM, we should document 
the behavior correctly, add tests, and generate warning messages if there are 
constant columns and `addIntercept` is false.

cc [~coderxiang] [~dbtsai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13029) Logistic regression returns inaccurate results when there is a column with identical value, and fit_intercept=false

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13029.
---
Resolution: Won't Fix

As discussed on the PR page, we decided to keep the current behavior, which is 
the same as glmnet. I'll create a PR for the documentation.

> Logistic regression returns inaccurate results when there is a column with 
> identical value, and fit_intercept=false
> ---
>
> Key: SPARK-13029
> URL: https://issues.apache.org/jira/browse/SPARK-13029
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Shuo Xiang
>Assignee: Shuo Xiang
>
> This is a bug that appears while fitting a Logistic Regression model with 
> `.setStandardization(false)` and `setFitIntercept(false)`. If the data matrix 
> has one column with identical value, the resulting model is not correct. 
> Specifically, the special column will always get a weight of 0, due to the 
> special check inside the code. However, the correct solution, which is unique 
> for L2 logistic regression, usually has non-zero weight.
> I use the heart_scale data 
> (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html) and 
> manually augmented the data matrix with a column of one (available in the 
> PR). The resulting data is run with reg=1.0, max_iter=1000, tol=1e-9 on the 
> following tools:
>  - libsvm
>  - scikit-learn
>  - sparkml
> (Notice libsvm and scikit-learn use a slightly different formulation, so 
> their regularizer is equivalently set to 1/270).
> The first two will have an objective value 0.7275 and give a solution vector:
> [0.03007516959304916, 0.09054186091216457, 0.09540306114820495, 
> 0.02436266296315414, 0.01739437315700921, -0.0006404006623321454
> 0.06367837291956932, -0.0589096636263823, 0.1382458934368336, 
> 0.06653302996539669, 0.07988499067852513, 0.1197789052423401, 
> 0.1801661775839843, -0.01248615347419409].
> Spark will produce an objective value 0.7278 and give a solution vector:
> [0.029917351003921247,0.08993936770232434,0.09458507615360119,0.024920710363734895,0.018259589234194296,5.929247527202199E-4,0.06362198973221662,-0.059307008587031494,0.13886738997128056,0.0678246717525043,0.08062880450385658,0.12084979858539521,0.180460850026883,0.0]
> Notice the last element of the weight vector is 0.
> A even simpler example is:
> {code:title=benchmark.py|borderStyle=solid}
> import numpy as np
> from sklearn.datasets import load_svmlight_file
> from sklearn.linear_model import LogisticRegression
> x_train = np.array([[1, 1], [0, 1]])
> y_train = np.array([1, 0])
> model = LogisticRegression(tol=1e-9, C=0.5, max_iter=1000, 
> fit_intercept=False).fit(x_train, y_train)
> print model.coef_
> [[ 0.22478867 -0.02241016]]
> {code}
> The same data trained by the current solver also gives a different result, 
> see the unit test in the PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13551) Fix fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13551:
--
Target Version/s: 2.0.0

> Fix fix wrong comment and remove meanless lines in 
> mllib.JavaBisectingKMeansExample
> ---
>
> Key: SPARK-13551
> URL: https://issues.apache.org/jira/browse/SPARK-13551
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.0.0
>
>
> this description is wrong:
> /**
>  * Java example for graph clustering using power iteration clustering (PIC).
>  */
> this for loop is meanless:
> for (Vector center: model.clusterCenters()) {
>   System.out.println("");
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13551) Fix fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13551:
--
Assignee: zhengruifeng

> Fix fix wrong comment and remove meanless lines in 
> mllib.JavaBisectingKMeansExample
> ---
>
> Key: SPARK-13551
> URL: https://issues.apache.org/jira/browse/SPARK-13551
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Trivial
> Fix For: 2.0.0
>
>
> this description is wrong:
> /**
>  * Java example for graph clustering using power iteration clustering (PIC).
>  */
> this for loop is meanless:
> for (Vector center: model.clusterCenters()) {
>   System.out.println("");
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13551) Fix fix wrong comment and remove meanless lines in mllib.JavaBisectingKMeansExample

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13551.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11429
[https://github.com/apache/spark/pull/11429]

> Fix fix wrong comment and remove meanless lines in 
> mllib.JavaBisectingKMeansExample
> ---
>
> Key: SPARK-13551
> URL: https://issues.apache.org/jira/browse/SPARK-13551
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: zhengruifeng
>Priority: Trivial
> Fix For: 2.0.0
>
>
> this description is wrong:
> /**
>  * Java example for graph clustering using power iteration clustering (PIC).
>  */
> this for loop is meanless:
> for (Vector center: model.clusterCenters()) {
>   System.out.println("");
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13546) GBT with many trees consistently giving java.lang.StackOverflowError

2016-02-29 Thread Glen Maisey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173261#comment-15173261
 ] 

Glen Maisey commented on SPARK-13546:
-

As a workaround I've thrown checkpointing on. It seems to truncate the DAG such 
that it doesn't crash.

{code}

// first line we add a checkpoint directory
sc.setCheckpointDir("mydir")

...
// we add CheckpointInterval and cache node ids
val gbt = new 
GBTRegressor().setMaxIter(1000).setStepSize(0.001).setMaxDepth(3).setMaxMemoryInMB(256).setMinInstancesPerNode(5).setSubsamplingRate(0.8).setMaxBins(10)
   .setCheckpointInterval(50).setCacheNodeIds(true)

...
{code}

Feels like the checkpointing might need to be compulsory for very iterative 
algorithms like the gbt.

> GBT with many trees consistently giving java.lang.StackOverflowError
> 
>
> Key: SPARK-13546
> URL: https://issues.apache.org/jira/browse/SPARK-13546
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Glen Maisey
> Attachments: iris.csv
>
>
> I've found that creating a GBT in SparkML with a large number of trees is 
> causing a stackoverflow error. The same code works with a smaller number of 
> trees. This was occurring on a large dataset however I've reproduced the 
> error using the tiny Fisher's Iris dataset to make it a bit easier to debug 
> (https://en.wikipedia.org/wiki/Iris_flower_data_set).
> Unfortunately I do not have the skills to fix the issue, just reproduce it. 
> Let me know if there are things I could test. Thanks!
> Edit: I've already tried increasing the JVM stack size significantly and that 
> doesn't solve the problem.
> {code}
> // Import
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.evaluation.RegressionEvaluator
> import org.apache.spark.ml.feature.RFormula
> import org.apache.spark.ml.regression.GBTRegressor
> // Load Data & specify model
> val data = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load("/user/glenm/iris.csv")
> val formulaString = "SepalLength ~ SepalWidth + PetalLength + PetalWidth + 
> Species"
> // Setup R Formula
> val formula = new RFormula().setFormula(formulaString)
> 
> // Setup evaluation metric
> val evaluator = new 
> RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("rmse")
>  
> val pipeline = new Pipeline().setStages(Array(formula))
> // Transform the data 
> val pipelineModel = pipeline.fit(data)
> val transformedData = pipelineModel.transform(data).select("features","label")
> // Split our data into test and train
> val Array(train, test) = transformedData.randomSplit(Array(0.9, 0.1))
> // Cache our training data
> val trainingData = train.repartition(2).cache()
> trainingData.count()
> // Setup GBT
> val gbt = new 
> GBTRegressor().setMaxIter(1000).setStepSize(0.001).setMaxDepth(3).setMaxMemoryInMB(256).setMinInstancesPerNode(5).setSubsamplingRate(0.8).setMaxBins(10)
> // Create the GBT   
> val myGBM = gbt.fit(trainingData)
> {code}
> Stack trace:
> java.lang.StackOverflowError
>   at 
> java.io.ObjectStreamClass.getPrimFieldValues(ObjectStreamClass.java:1233)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1532)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at 
> scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:137)
>   at 
> scala.collection.mutable.HashMap$$anonfun$writeObject$1.apply(HashMap.scala:135)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at 
> scala.collection.mutable.HashTable$class.serializeTo(HashTable.scala:124)
>   at scala.collection.mutable.HashMap.serializeTo(HashMap.scala:39)
>   at scala.collection.mutable.HashMap.writeObject(HashMap.scala:135)
>   at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutput

[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0

2016-02-29 Thread Jay Panicker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173254#comment-15173254
 ] 

Jay Panicker commented on SPARK-13117:
--


Sorry, [~devaraj.k] , I did not check the full thread - ran into this issue and 
fixed it the way I mentioned.

Now, digging a bit deeper:
{code}
  protected val publicHostName = 
Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName)
{code}
and localHostName comes from:
{code}
  protected val localHostName = Utils.localHostNameForURI()
{code}
and localHostNameForURI is define as:
{code}
  def localHostNameForURI(): String = {
customHostname.getOrElse(InetAddresses.toUriString(localIpAddress))
  }
{code}
ok.. 
{code}
  def setCustomHostname(hostname: String) {
// DEBUG code
Utils.checkHost(hostname)
customHostname = Some(hostname)
  }
{code}
and
{code}
  private lazy val localIpAddress: InetAddress = findLocalInetAddress()
{code}
And findLocalInetAddress does this
{code}
val defaultIpOverride = System.getenv("SPARK_LOCAL_IP")
{code}
and among other things, also chages loopback IP to real IP... 

While  using "0.0.0.0"  instead of localHostName certainly fixes the 
loopback/localhost issue, it bypasses the sequence of steps above, so might 
have other side effects..?
 


> WebUI should use the local ip not 0.0.0.0
> -
>
> Key: SPARK-13117
> URL: https://issues.apache.org/jira/browse/SPARK-13117
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jeremiah Jordan
>Priority: Minor
>
> When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP 
> except the WebUI.  The WebUI should use the SPARK_LOCAL_IP not always use 
> 0.0.0.0
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13588) Unable to map Parquet file to Hive Table using HiveContext

2016-02-29 Thread Akshat Thakar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akshat Thakar updated SPARK-13588:
--
Summary: Unable to map Parquet file to Hive Table using HiveContext  (was: 
Unable to Map Parquet file to Hive Table using HiveContext)

> Unable to map Parquet file to Hive Table using HiveContext
> --
>
> Key: SPARK-13588
> URL: https://issues.apache.org/jira/browse/SPARK-13588
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.1
> Environment: Linux Red Hat 6.7, Hortonworks HDP 2.3 Distribution 
>Reporter: Akshat Thakar
>Priority: Minor
>
> I am trying to map existing Parquet file with external table using Pyspark 
> script. I am using Hive Context to execute Hive SQL.
> I was able to execute same SQL command using Hive shell.
> I get below error while doing so- 
> >>> hive.sql('create external table temp_inserts like new_inserts stored as 
> >>> parquet LOCATION "/user/hive/warehouse/temp_inserts')
> 16/03/01 06:24:01 INFO ParseDriver: Parsing command: create external table 
> temp_inserts like new_inserts stored as parquet LOCATION 
> "/user/hive/warehouse/temp_inserts
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/sql/context.py", line 488, 
> in sql
> return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
>   File 
> "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>  line 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql.
> : org.apache.spark.sql.AnalysisException: missing EOF at 'stored' near 
> 'new_inserts'; line 1 pos 52
> at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:254)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
> at 
> org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
> at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
> at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
> at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
> at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
> at 
> org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
> at 
> scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
> at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
> at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala

[jira] [Updated] (SPARK-13588) Unable to Map Parquet file to Hive Table using HiveContext

2016-02-29 Thread Akshat Thakar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akshat Thakar updated SPARK-13588:
--
Description: 
I am trying to map existing Parquet file with external table using Pyspark 
script. I am using Hive Context to execute Hive SQL.
I was able to execute same SQL command using Hive shell.

I get below error while doing so- 

>>> hive.sql('create external table temp_inserts like new_inserts stored as 
>>> parquet LOCATION "/user/hive/warehouse/temp_inserts')
16/03/01 06:24:01 INFO ParseDriver: Parsing command: create external table 
temp_inserts like new_inserts stored as parquet LOCATION 
"/user/hive/warehouse/temp_inserts
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/sql/context.py", line 488, 
in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
  File 
"/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql.
: org.apache.spark.sql.AnalysisException: missing EOF at 'stored' near 
'new_inserts'; line 1 pos 52
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:254)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
a

[jira] [Updated] (SPARK-13588) Unable to Map Parquet file to Hive Table using HiveContext

2016-02-29 Thread Akshat Thakar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akshat Thakar updated SPARK-13588:
--
Description: 
I am trying to map existing Parquet file with external table using Pyspark 
script. I am using Hive Context to execute Hive SQL.
I was able to execute same SQL command using Hive shell.

I get below error while doing so- 

>>> hive.sql('create external table temp_inserts like new_inserts  stored as 
>>> parquet LOCATION "/user/hive/warehouse/temp_inserts')
16/03/01 06:16:35 INFO ParseDriver: Parsing command: create external table 
cdc_new like cdc_new  stored as parquet LOCATION 
"/user/hive/warehouse/cdc_temp_inserts
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/sql/context.py", line 488, 
in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
  File 
"/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql.
: org.apache.spark.sql.AnalysisException: missing EOF at 'stored' near 
'temp_inserts'; line 1 pos 44
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:254)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 

[jira] [Commented] (SPARK-13589) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType

2016-02-29 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173249#comment-15173249
 ] 

Cheng Lian commented on SPARK-13589:


[~nongli] Is SPARK-13533 related to this one?

> Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType
> ---
>
> Key: SPARK-13589
> URL: https://issues.apache.org/jira/browse/SPARK-13589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>  Labels: flaky-test
>
> Here are a few sample build failures caused by this test case:
> # 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52164/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
> # 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52154/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
> # 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52153/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
> (I've pinned these builds on Jenkins so that they won't be cleaned up.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13589) Flaky test: ParquetHadoopFsRelationSuite.test all data types - ByteType

2016-02-29 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-13589:
--

 Summary: Flaky test: ParquetHadoopFsRelationSuite.test all data 
types - ByteType
 Key: SPARK-13589
 URL: https://issues.apache.org/jira/browse/SPARK-13589
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.0.0
Reporter: Cheng Lian


Here are a few sample build failures caused by this test case:

# 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52164/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
# 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52154/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/
# 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52153/testReport/org.apache.spark.sql.sources/ParquetHadoopFsRelationSuite/test_all_data_types___ByteType/

(I've pinned these builds on Jenkins so that they won't be cleaned up.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13588) Unable to Map Parquet file to Hive Table using HiveContext

2016-02-29 Thread Akshat Thakar (JIRA)
Akshat Thakar created SPARK-13588:
-

 Summary: Unable to Map Parquet file to Hive Table using HiveContext
 Key: SPARK-13588
 URL: https://issues.apache.org/jira/browse/SPARK-13588
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
 Environment: Linux Red Hat 6.7, Hortonworks HDP 2.3 Distribution 
Reporter: Akshat Thakar
Priority: Minor


I am trying to map existing Parquet file with external table using Pyspark 
script. I am using Hive Context to execute Hive SQL.
I was able to execute same SQL command using Hive shell.

I get below error while doing so- 

>>> hive.sql('create external table temp_inserts like new_inserts  stored as 
>>> parquet LOCATION "/user/hive/warehouse/temp_inserts')
16/03/01 06:16:35 INFO ParseDriver: Parsing command: create external table 
cdc_new like cdc_new  stored as parquet LOCATION 
"/user/hive/warehouse/cdc_temp_inserts
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/sql/context.py", line 488, 
in sql
return DataFrame(self._ssql_ctx.sql(sqlQuery), self)
  File 
"/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 538, in __call__
  File 
"/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o36.sql.
: org.apache.spark.sql.AnalysisException: missing EOF at 'stored' near 
'cdc_new'; line 1 pos 44
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:254)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:95)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:

[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173228#comment-15173228
 ] 

Jeff Zhang edited comment on SPARK-13587 at 3/1/16 5:04 AM:


I have implemented POC for this features. Here's oen simple command for how to 
use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
I have implemented POC for this features. Here's oen simple command for how to 
execute use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173228#comment-15173228
 ] 

Jeff Zhang commented on SPARK-13587:


I have implemented POC for this features. Here's oen simple command for how to 
execute use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 property needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173228#comment-15173228
 ] 

Jeff Zhang edited comment on SPARK-13587 at 3/1/16 5:02 AM:


I have implemented POC for this features. Here's oen simple command for how to 
execute use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
I have implemented POC for this features. Here's oen simple command for how to 
execute use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 property needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-29 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173226#comment-15173226
 ] 

Xiao Li commented on SPARK-12720:
-

[~yhuai] I did a few tries. Unfortunately, its Expand is different from what 
GroupingSet is forming. We need a separate function for multi-distinct 
aggregation queries. : (

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13587) Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-13587:
--

 Summary: Support virtualenv in PySpark
 Key: SPARK-13587
 URL: https://issues.apache.org/jira/browse/SPARK-13587
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Jeff Zhang


Currently, it's not easy for user to add third party python packages in pyspark.
* One way is to using --py-files (suitable for simple dependency, but not 
suitable for complicated dependency, especially with transitive dependency)
* Another way is install packages manually on each node (time wasting, and not 
easy to switch to different environment)

Python has now 2 different virtualenv implementation. One is native virtualenv 
another is through conda. This jira is trying to migrate these 2 tools to 
distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0

2016-02-29 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173224#comment-15173224
 ] 

Devaraj K commented on SPARK-13117:
---

I think we can start the Jetty server with the default value as "0.0.0.0" and 
it can take effect of the configured value for SPARK_PUBLIC_DNS if it is 
configured. It would change only for the Web UI and doesn't impact any other 
things. Changes will be some thing like below,

{code:xml}
protected val publicHostName = 
Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse("0.0.0.0")
{code}

{code:xml}
try {
  serverInfo = Some(startJettyServer(publicHostName, port, sslOptions, 
handlers, conf, name))
  logInfo("Started %s at http://%s:%d".format(className, publicHostName, 
boundPort))
} catch {
{code}

[~srowen], any suggestions? Thanks


> WebUI should use the local ip not 0.0.0.0
> -
>
> Key: SPARK-13117
> URL: https://issues.apache.org/jira/browse/SPARK-13117
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jeremiah Jordan
>Priority: Minor
>
> When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP 
> except the WebUI.  The WebUI should use the SPARK_LOCAL_IP not always use 
> 0.0.0.0
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0

2016-02-29 Thread Devaraj K (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173218#comment-15173218
 ] 

Devaraj K commented on SPARK-13117:
---

[~jaypanicker], the proposed PR does the same but there is a problem while 
accessing web UI using the localhost or 127.0.0.1. Please have look into this 
comment https://github.com/apache/spark/pull/11133#issuecomment-188937933.

> WebUI should use the local ip not 0.0.0.0
> -
>
> Key: SPARK-13117
> URL: https://issues.apache.org/jira/browse/SPARK-13117
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jeremiah Jordan
>Priority: Minor
>
> When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP 
> except the WebUI.  The WebUI should use the SPARK_LOCAL_IP not always use 
> 0.0.0.0
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13117) WebUI should use the local ip not 0.0.0.0

2016-02-29 Thread Jay Panicker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173214#comment-15173214
 ] 

Jay Panicker edited comment on SPARK-13117 at 3/1/16 4:40 AM:
--

On systems with multiple interfaces, ability to select the bind ip will be 
nice, instead of binding to all.

Actually, the code has everything needed to do it, except one change.

Relevant lines from 
spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala:
{code}
...
...
  protected val publicHostName = 
Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName)
...
..
try {
  serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name))
  logInfo("Started %s at http://%s:%d".format(className, publicHostName, 
boundPort))
} 
{code}
Note the "0.0.0.0", even though publicHostName is available as a configuration 
option.

Making the following  change and an export 
SPARK_PUBLIC_DNS= in spark-config.sh solved the problem:

  serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, 
name))



was (Author: jaypanicker):

On systems with multiple interfaces, ability to select the bind ip will be 
nice, instead of binding to all.

Actually, the code has everything needed to do it, except one change.

Relevant lines from 
spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala:
...
...
  protected val publicHostName = 
Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName)
...
..
try {
  serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name))
  logInfo("Started %s at http://%s:%d".format(className, publicHostName, 
boundPort))
} 

Note the "0.0.0.0", even though publicHostName is available as a configuration 
option.

Making the following  change and an export 
SPARK_PUBLIC_DNS= in spark-config.sh solved the problem:

  serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, 
name))


> WebUI should use the local ip not 0.0.0.0
> -
>
> Key: SPARK-13117
> URL: https://issues.apache.org/jira/browse/SPARK-13117
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jeremiah Jordan
>Priority: Minor
>
> When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP 
> except the WebUI.  The WebUI should use the SPARK_LOCAL_IP not always use 
> 0.0.0.0
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13117) WebUI should use the local ip not 0.0.0.0

2016-02-29 Thread Jay Panicker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173214#comment-15173214
 ] 

Jay Panicker edited comment on SPARK-13117 at 3/1/16 4:40 AM:
--

On systems with multiple interfaces, ability to select the bind ip will be 
nice, instead of binding to all.

Actually, the code has everything needed to do it, except one change.

Relevant lines from 
spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala:
{code}
...
...
  protected val publicHostName = 
Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName)
...
..
try {
  serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name))
  logInfo("Started %s at http://%s:%d".format(className, publicHostName, 
boundPort))
} 
{code}
Note the "0.0.0.0", even though publicHostName is available as a configuration 
option.

Making the following  change and an export 
SPARK_PUBLIC_DNS= in spark-config.sh solved the problem:
{code}
  serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, 
name))
{code}


was (Author: jaypanicker):
On systems with multiple interfaces, ability to select the bind ip will be 
nice, instead of binding to all.

Actually, the code has everything needed to do it, except one change.

Relevant lines from 
spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala:
{code}
...
...
  protected val publicHostName = 
Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName)
...
..
try {
  serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name))
  logInfo("Started %s at http://%s:%d".format(className, publicHostName, 
boundPort))
} 
{code}
Note the "0.0.0.0", even though publicHostName is available as a configuration 
option.

Making the following  change and an export 
SPARK_PUBLIC_DNS= in spark-config.sh solved the problem:

  serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, 
name))


> WebUI should use the local ip not 0.0.0.0
> -
>
> Key: SPARK-13117
> URL: https://issues.apache.org/jira/browse/SPARK-13117
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jeremiah Jordan
>Priority: Minor
>
> When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP 
> except the WebUI.  The WebUI should use the SPARK_LOCAL_IP not always use 
> 0.0.0.0
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13117) WebUI should use the local ip not 0.0.0.0

2016-02-29 Thread Jay Panicker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173214#comment-15173214
 ] 

Jay Panicker commented on SPARK-13117:
--


On systems with multiple interfaces, ability to select the bind ip will be 
nice, instead of binding to all.

Actually, the code has everything needed to do it, except one change.

Relevant lines from 
spark-1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala:
...
...
  protected val publicHostName = 
Option(conf.getenv("SPARK_PUBLIC_DNS")).getOrElse(localHostName)
...
..
try {
  serverInfo = Some(startJettyServer("0.0.0.0", port, handlers, conf, name))
  logInfo("Started %s at http://%s:%d".format(className, publicHostName, 
boundPort))
} 

Note the "0.0.0.0", even though publicHostName is available as a configuration 
option.

Making the following  change and an export 
SPARK_PUBLIC_DNS= in spark-config.sh solved the problem:

  serverInfo = Some(startJettyServer(publicHostName, port, handlers, conf, 
name))


> WebUI should use the local ip not 0.0.0.0
> -
>
> Key: SPARK-13117
> URL: https://issues.apache.org/jira/browse/SPARK-13117
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jeremiah Jordan
>Priority: Minor
>
> When SPARK_LOCAL_IP is set everything seems to correctly bind and use that IP 
> except the WebUI.  The WebUI should use the SPARK_LOCAL_IP not always use 
> 0.0.0.0
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L137



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-02-29 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173163#comment-15173163
 ] 

Gayathri Murali edited comment on SPARK-6160 at 3/1/16 3:57 AM:


[~josephkb] How should the test stat results be persisted? 

Option 1 :  is to persist the test stats as data member of ChiSqSelectorModel 
class, in which case it needs to be initialized using the constructor.This 
would mean modifying the code where the object of the class is being 
instantiated. 

Option 2 : To create an auxiliary class which would just store the test stat 
results. 


was (Author: gayathrimurali):
[~josephkb] Should the test statistics result be stored as a text/parquet file? 
or Can it just be stored in a local array? 

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-02-29 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173163#comment-15173163
 ] 

Gayathri Murali edited comment on SPARK-6160 at 3/1/16 3:53 AM:


[~josephkb] Should the test statistics result be stored as a text/parquet file? 
or Can it just be stored in a local array? 


was (Author: gayathrimurali):
[~josephkb] Should the test statistics result be stored as a text/parquet file? 
or Can it just be stored in a local array? 

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext

2016-02-29 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-13586:

Priority: Minor  (was: Major)

> add config to skip generate down time batch when restart StreamingContext
> -
>
> Key: SPARK-13586
> URL: https://issues.apache.org/jira/browse/SPARK-13586
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: jeanlyn
>Priority: Minor
>
> If we restart streaming, which using checkpoint and has stopped for hours, it 
> will generate a lot of batch to the queue, and it need to take a while to 
> handle this batches. So i propose to add a config to control whether generate 
> down time batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext

2016-02-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13586:


Assignee: Apache Spark

> add config to skip generate down time batch when restart StreamingContext
> -
>
> Key: SPARK-13586
> URL: https://issues.apache.org/jira/browse/SPARK-13586
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: jeanlyn
>Assignee: Apache Spark
>
> If we restart streaming, which using checkpoint and has stopped for hours, it 
> will generate a lot of batch to the queue, and it need to take a while to 
> handle this batches. So i propose to add a config to control whether generate 
> down time batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext

2016-02-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13586:


Assignee: (was: Apache Spark)

> add config to skip generate down time batch when restart StreamingContext
> -
>
> Key: SPARK-13586
> URL: https://issues.apache.org/jira/browse/SPARK-13586
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: jeanlyn
>
> If we restart streaming, which using checkpoint and has stopped for hours, it 
> will generate a lot of batch to the queue, and it need to take a while to 
> handle this batches. So i propose to add a config to control whether generate 
> down time batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext

2016-02-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173177#comment-15173177
 ] 

Apache Spark commented on SPARK-13586:
--

User 'jeanlyn' has created a pull request for this issue:
https://github.com/apache/spark/pull/11440

> add config to skip generate down time batch when restart StreamingContext
> -
>
> Key: SPARK-13586
> URL: https://issues.apache.org/jira/browse/SPARK-13586
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: jeanlyn
>
> If we restart streaming, which using checkpoint and has stopped for hours, it 
> will generate a lot of batch to the queue, and it need to take a while to 
> handle this batches. So i propose to add a config to control whether generate 
> down time batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13586) add config to skip generate down time batch when restart StreamingContext

2016-02-29 Thread jeanlyn (JIRA)
jeanlyn created SPARK-13586:
---

 Summary: add config to skip generate down time batch when restart 
StreamingContext
 Key: SPARK-13586
 URL: https://issues.apache.org/jira/browse/SPARK-13586
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.6.0
Reporter: jeanlyn


If we restart streaming, which using checkpoint and has stopped for hours, it 
will generate a lot of batch to the queue, and it need to take a while to 
handle this batches. So i propose to add a config to control whether generate 
down time batch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6160) ChiSqSelector should keep test statistic info

2016-02-29 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173163#comment-15173163
 ] 

Gayathri Murali commented on SPARK-6160:


[~josephkb] Should the test statistics result be stored as a text/parquet file? 
or Can it just be stored in a local array? 

> ChiSqSelector should keep test statistic info
> -
>
> Key: SPARK-6160
> URL: https://issues.apache.org/jira/browse/SPARK-6160
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It is useful to have the test statistics explaining selected features, but 
> these data are thrown out when constructing the ChiSqSelectorModel.  The data 
> are expensive to recompute, so the ChiSqSelectorModel should store and expose 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12719) SQL generation support for generators (including UDTF)

2016-02-29 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173140#comment-15173140
 ] 

Xiao Li commented on SPARK-12719:
-

Actually, I involved my teammate to work on this. The PR is close to finish. He 
will submit it this week. : ) Thanks!

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12721) SQL generation support for script transformation

2016-02-29 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173138#comment-15173138
 ] 

Xiao Li commented on SPARK-12721:
-

Sorry, this is delayed until Spark-13535 is resolved. Thanks!

> SQL generation support for script transformation
> 
>
> Key: SPARK-12721
> URL: https://issues.apache.org/jira/browse/SPARK-12721
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-29 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173133#comment-15173133
 ] 

Xiao Li commented on SPARK-12720:
-

Yeah. This is what I hope. Let me use this PR to do a quick proof of concept. 
Will submit a PR for you to review. Maybe tonight. : )

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13580) Driver makes no progress when Executor's akka thread exits due to OOM.

2016-02-29 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated SPARK-13580:
---
Summary: Driver makes no progress when Executor's akka thread exits due to 
OOM.  (was: Driver makes no progress after failed to remove broadcast on 
Executor)

> Driver makes no progress when Executor's akka thread exits due to OOM.
> --
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, 
> stderrfiltered.txt.gz
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-02-29 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173070#comment-15173070
 ] 

Yin Huai commented on SPARK-12720:
--

[~smilegator] Will the approach of handling expand in the PR be also applicable 
to handle multi-distinct aggregation queries?

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xiao Li
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13581) LibSVM throws MatchError

2016-02-29 Thread Jakob Odersky (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakob Odersky updated SPARK-13581:
--
Description: 
When running an action on a DataFrame obtained by reading from a libsvm file a 
MatchError is thrown, however doing the same on a cached DataFrame works fine.
{code}
val df = 
sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
//file is in spark repository

df.select(df("features")).show() //MatchError

df.cache()
df.select(df("features")).show() //OK
{code}

The exception stack trace is the following:
{code}
scala.MatchError: 1.0 (of class java.lang.Double)
[info]  at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
[info]  at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
[info]  at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
[info]  at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
[info]  at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
[info]  at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
[info]  at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
{code}

This issue first appeared in commit {{1dac964c1}}, in PR 
[#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622.

[~jeffzhang], do you have any insight of what could be going on?

cc [~iyounus]

  was:
When running an action on a DataFrame obtained by reading from a libsvm file a 
MatchError is thrown, however doing the same on a cached DataFrame works fine.
{code}
val df = 
sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
//file is

df.select(df("features")).show() //MatchError

df.cache()
df.select(df("features")).show() //OK
{code}

The exception stack trace is the following:
{code}
scala.MatchError: 1.0 (of class java.lang.Double)
[info]  at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
[info]  at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
[info]  at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
[info]  at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
[info]  at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
[info]  at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
[info]  at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
{code}

This issue first appeared in commit {{1dac964c1}}, in PR 
[#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622.

[~jeffzhang], do you have any insight of what could be going on?

cc [~iyounus]


> LibSVM throws MatchError
> 
>
> Key: SPARK-13581
> URL: https://issues.apache.org/jira/browse/SPARK-13581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jakob Odersky
>Assignee: Jeff Zhang
>Priority: Minor
>
> When running an action on a DataFrame obtained by reading from a libsvm file 
> a MatchError is thrown, however doing the same on a cached DataFrame works 
> fine.
> {code}
> val df = 
> sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
> //file is in spark repository
> df.select(df("features")).show() //MatchError
> df.cache()
> df.select(df("features")).show() //OK
> {code}
> The exception stack trace is the following:
> {code}
> scala.MatchError: 1.0 (of class java.lang.Double)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$

[jira] [Commented] (SPARK-13581) LibSVM throws MatchError

2016-02-29 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173058#comment-15173058
 ] 

Jakob Odersky commented on SPARK-13581:
---

It's in spark "data/mllib/sample_libsvm_data.txt"

> LibSVM throws MatchError
> 
>
> Key: SPARK-13581
> URL: https://issues.apache.org/jira/browse/SPARK-13581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jakob Odersky
>Assignee: Jeff Zhang
>Priority: Minor
>
> When running an action on a DataFrame obtained by reading from a libsvm file 
> a MatchError is thrown, however doing the same on a cached DataFrame works 
> fine.
> {code}
> val df = 
> sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
> //file is
> df.select(df("features")).show() //MatchError
> df.cache()
> df.select(df("features")).show() //OK
> {code}
> The exception stack trace is the following:
> {code}
> scala.MatchError: 1.0 (of class java.lang.Double)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
> {code}
> This issue first appeared in commit {{1dac964c1}}, in PR 
> [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622.
> [~jeffzhang], do you have any insight of what could be going on?
> cc [~iyounus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13581) LibSVM throws MatchError

2016-02-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173052#comment-15173052
 ] 

Jeff Zhang commented on SPARK-13581:


[~jodersky] Can you attach the data file ? I guess it it small. 

> LibSVM throws MatchError
> 
>
> Key: SPARK-13581
> URL: https://issues.apache.org/jira/browse/SPARK-13581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jakob Odersky
>Assignee: Jeff Zhang
>Priority: Minor
>
> When running an action on a DataFrame obtained by reading from a libsvm file 
> a MatchError is thrown, however doing the same on a cached DataFrame works 
> fine.
> {code}
> val df = 
> sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
> //file is
> df.select(df("features")).show() //MatchError
> df.cache()
> df.select(df("features")).show() //OK
> {code}
> The exception stack trace is the following:
> {code}
> scala.MatchError: 1.0 (of class java.lang.Double)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
> {code}
> This issue first appeared in commit {{1dac964c1}}, in PR 
> [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622.
> [~jeffzhang], do you have any insight of what could be going on?
> cc [~iyounus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13583) Support `UnusedImports` Java checkstyle rule

2016-02-29 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173040#comment-15173040
 ] 

Dongjoon Hyun commented on SPARK-13583:
---

I changed the content of this issue since `lint-java` is not executed 
automatically by Jenkins as of today.
Thanks, [~zsxwing]

> Support `UnusedImports` Java checkstyle rule
> 
>
> Key: SPARK-13583
> URL: https://issues.apache.org/jira/browse/SPARK-13583
> Project: Spark
>  Issue Type: Task
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review 
> by saving much time.
> This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` 
> rule to `checkstyle.xml` and fixing all existing unused imports.
> {code:title=checkstyle.xml|borderStyle=solid}
> +
> {code}
> Unfortunately, `dev/lint-java` is not tested by Jenkins. ( 
> https://github.com/apache/spark/blob/master/dev/run-tests.py#L546 )
> This will also help Spark contributors to check by themselves before 
> submitting their PRs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13583) Support `UnusedImports` Java checkstyle rule

2016-02-29 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13583:
--
Description: 
After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review 
by saving much time.

This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` 
rule to `checkstyle.xml` and fixing all existing unused imports.
{code:title=checkstyle.xml|borderStyle=solid}
+
{code}

Unfortunately, `dev/lint-java` is not tested by Jenkins. ( 
https://github.com/apache/spark/blob/master/dev/run-tests.py#L546 )

This will also help Spark contributors to check by themselves before submitting 
their PRs.

  was:
After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review 
by saving much time.

This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` 
rule to `checkstyle.xml` and fixing all existing unused imports.
{code:title=checkstyle.xml|borderStyle=solid}
+
{code}

This will also prevent the upcoming PR from having unused imports.

Summary: Support `UnusedImports` Java checkstyle rule  (was: Enforce 
`UnusedImports` Java checkstyle rule)

> Support `UnusedImports` Java checkstyle rule
> 
>
> Key: SPARK-13583
> URL: https://issues.apache.org/jira/browse/SPARK-13583
> Project: Spark
>  Issue Type: Task
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review 
> by saving much time.
> This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` 
> rule to `checkstyle.xml` and fixing all existing unused imports.
> {code:title=checkstyle.xml|borderStyle=solid}
> +
> {code}
> Unfortunately, `dev/lint-java` is not tested by Jenkins. ( 
> https://github.com/apache/spark/blob/master/dev/run-tests.py#L546 )
> This will also help Spark contributors to check by themselves before 
> submitting their PRs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13585) addPyFile behavior change between 1.6 and before

2016-02-29 Thread Santhosh Gorantla Ramakrishna (JIRA)
Santhosh Gorantla Ramakrishna created SPARK-13585:
-

 Summary: addPyFile behavior change between 1.6 and before
 Key: SPARK-13585
 URL: https://issues.apache.org/jira/browse/SPARK-13585
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Santhosh Gorantla Ramakrishna
Priority: Minor


addPyFile in earlier versions would remove the .py file if it already existed. 
In 1.6, it throws an exception "__.py exists and does not match contents of 
__.py".

This might be because the underlying scala code needs an overwrite parameter, 
and this is being defaulted to false when called from python.
  private def copyFile(
  url: String,
  sourceFile: File,
  destFile: File,
  fileOverwrite: Boolean,
  removeSourceFile: Boolean = false): Unit = {

Would be good if addPyFile takes a parameter to set the overwrite and default 
to false.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172786#comment-15172786
 ] 

Shixiong Zhu edited comment on SPARK-13580 at 3/1/16 12:06 AM:
---

It hapens in an Akka thread. You can add {{-XX:OnOutOfMemoryError='kill %p'}} 
to the executor java options to force the executor exit for OOM

Or just upgrade to 1.6.0 which doesn't use Akka any more.


was (Author: zsxwing):
It hapens in an Akka thread. You can add {{-XX:OnOutOfMemoryError='kill -9 
%p'}} to the executor java options to force the executor exit for OOM

Or just upgrade to 1.6.0 which doesn't use Akka any more.

> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, 
> stderrfiltered.txt.gz
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu closed SPARK-13580.

Resolution: Not A Bug

> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, 
> stderrfiltered.txt.gz
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage

2016-02-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13584:


Assignee: (was: Apache Spark)

> ContinuousQueryManagerSuite floods the logs with garbage
> 
>
> Key: SPARK-13584
> URL: https://issues.apache.org/jira/browse/SPARK-13584
> Project: Spark
>  Issue Type: Test
>Reporter: Shixiong Zhu
>
> We should clean up the following outputs
> {code}
> [info] ContinuousQueryManagerSuite:
> 16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 0.0 (TID 1)
> java.lang.ArithmeticException: / by zero
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> 16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.SparkContext$$anonfun$ru

[jira] [Commented] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage

2016-02-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172910#comment-15172910
 ] 

Apache Spark commented on SPARK-13584:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/11439

> ContinuousQueryManagerSuite floods the logs with garbage
> 
>
> Key: SPARK-13584
> URL: https://issues.apache.org/jira/browse/SPARK-13584
> Project: Spark
>  Issue Type: Test
>Reporter: Shixiong Zhu
>
> We should clean up the following outputs
> {code}
> [info] ContinuousQueryManagerSuite:
> 16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 0.0 (TID 1)
> java.lang.ArithmeticException: / by zero
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> 16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.r

[jira] [Assigned] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage

2016-02-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13584:


Assignee: Apache Spark

> ContinuousQueryManagerSuite floods the logs with garbage
> 
>
> Key: SPARK-13584
> URL: https://issues.apache.org/jira/browse/SPARK-13584
> Project: Spark
>  Issue Type: Test
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> We should clean up the following outputs
> {code}
> [info] ContinuousQueryManagerSuite:
> 16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 0.0 (TID 1)
> java.lang.ArithmeticException: / by zero
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> 16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at 
> org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
>   at 
> org.apache.spark

[jira] [Created] (SPARK-13584) ContinuousQueryManagerSuite floods the logs with garbage

2016-02-29 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-13584:


 Summary: ContinuousQueryManagerSuite floods the logs with garbage
 Key: SPARK-13584
 URL: https://issues.apache.org/jira/browse/SPARK-13584
 Project: Spark
  Issue Type: Test
Reporter: Shixiong Zhu


We should clean up the following outputs
{code}
[info] ContinuousQueryManagerSuite:
16:30:20.473 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in 
stage 0.0 (TID 1)
java.lang.ArithmeticException: / by zero
at 
org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
at 
org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
at 
org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:81)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
16:30:20.506 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
stage 0.0 (TID 1, localhost): java.lang.ArithmeticException: / by zero
at 
org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply$mcII$sp(ContinuousQueryManagerSuite.scala:303)
at 
org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
at 
org.apache.spark.sql.streaming.ContinuousQueryManagerSuite$$anonfun$6.apply(ContinuousQueryManagerSuite.scala:303)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
at 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:847)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1802)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
at org.apache.spark.scheduler.Task.run(Task.scala:81)
at org

[jira] [Assigned] (SPARK-13583) Enforce `UnusedImports` Java checkstyle rule

2016-02-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13583:


Assignee: (was: Apache Spark)

> Enforce `UnusedImports` Java checkstyle rule
> 
>
> Key: SPARK-13583
> URL: https://issues.apache.org/jira/browse/SPARK-13583
> Project: Spark
>  Issue Type: Task
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review 
> by saving much time.
> This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` 
> rule to `checkstyle.xml` and fixing all existing unused imports.
> {code:title=checkstyle.xml|borderStyle=solid}
> +
> {code}
> This will also prevent the upcoming PR from having unused imports.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13583) Enforce `UnusedImports` Java checkstyle rule

2016-02-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172882#comment-15172882
 ] 

Apache Spark commented on SPARK-13583:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11438

> Enforce `UnusedImports` Java checkstyle rule
> 
>
> Key: SPARK-13583
> URL: https://issues.apache.org/jira/browse/SPARK-13583
> Project: Spark
>  Issue Type: Task
>Reporter: Dongjoon Hyun
>Priority: Trivial
>
> After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review 
> by saving much time.
> This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` 
> rule to `checkstyle.xml` and fixing all existing unused imports.
> {code:title=checkstyle.xml|borderStyle=solid}
> +
> {code}
> This will also prevent the upcoming PR from having unused imports.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-29 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172881#comment-15172881
 ] 

Zhong Wang edited comment on SPARK-13337 at 2/29/16 11:40 PM:
--

It doesn't help in my case, because it doesn't support null-safe joins. It 
would be great if there is an interface like:

{code}
def join(right: DataFrame, usingColumns: Seq[String], joinType: String, 
nullSafe:Boolean): DataFrame
{code}

The current join-using-column interface works great if the joining tables 
doesn't contain null values: it can eliminate the null columns generated from 
outer joins automatically. The general joining methods in your example support 
null-safe joins perfectly, but it cannot automatically eliminate the null 
columns, which are generated from outer joins.

Sorry that it is a little bit complicated here. Please let me know if you need 
a concrete example.


was (Author: zwang):
It doesn't help in my case, because it doesn't support null-safe joins. It 
would be great if there is an interface like:

{code}
def join(right: DataFrame, usingColumns: Seq[String], joinType: String, 
nullSafe:Boolean): DataFrame
{code}

It works great if the joining tables doesn't contain null values: it can 
eliminate the null columns generated from outer joins automatically. The 
general joining methods in your example support null-safe joins perfectly, but 
it cannot automatically eliminate the null columns, which are generated from 
outer joins.

Sorry that it is a little bit complicated here. Please let me know if you need 
a concrete example.

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13583) Enforce `UnusedImports` Java checkstyle rule

2016-02-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13583:


Assignee: Apache Spark

> Enforce `UnusedImports` Java checkstyle rule
> 
>
> Key: SPARK-13583
> URL: https://issues.apache.org/jira/browse/SPARK-13583
> Project: Spark
>  Issue Type: Task
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>
> After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review 
> by saving much time.
> This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` 
> rule to `checkstyle.xml` and fixing all existing unused imports.
> {code:title=checkstyle.xml|borderStyle=solid}
> +
> {code}
> This will also prevent the upcoming PR from having unused imports.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-29 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172881#comment-15172881
 ] 

Zhong Wang commented on SPARK-13337:


It doesn't help in my case, because it doesn't support null-safe joins. It 
would be great if there is an interface like:

{code}
def join(right: DataFrame, usingColumns: Seq[String], joinType: String, 
nullSafe:Boolean): DataFrame
{code}

It works great if the joining tables doesn't contain null values: it can 
eliminate the null columns generated from outer joins automatically. The 
general joining methods in your example support null-safe joins perfectly, but 
it cannot automatically eliminate the null columns, which are generated from 
outer joins.

Sorry that it is a little bit complicated here. Please let me know if you need 
a concrete example.

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13583) Enforce `UnusedImports` Java checkstyle rule

2016-02-29 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-13583:
-

 Summary: Enforce `UnusedImports` Java checkstyle rule
 Key: SPARK-13583
 URL: https://issues.apache.org/jira/browse/SPARK-13583
 Project: Spark
  Issue Type: Task
Reporter: Dongjoon Hyun
Priority: Trivial


After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review 
by saving much time.

This issue aims to enforce `UnusedImports` rule by adding a `UnusedImports` 
rule to `checkstyle.xml` and fixing all existing unused imports.
{code:title=checkstyle.xml|borderStyle=solid}
+
{code}

This will also prevent the upcoming PR from having unused imports.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13582) Improve performance of parquet reader with dictionary encoding

2016-02-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13582:


Assignee: Apache Spark  (was: Davies Liu)

> Improve performance of parquet reader with dictionary encoding
> --
>
> Key: SPARK-13582
> URL: https://issues.apache.org/jira/browse/SPARK-13582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Right now, we replace the ids with value from a dictionary before accessing a 
> column. We could defer that, especially when some rows are filtered out, we 
> will not lookup this dictionary for those rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13582) Improve performance of parquet reader with dictionary encoding

2016-02-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13582:


Assignee: Davies Liu  (was: Apache Spark)

> Improve performance of parquet reader with dictionary encoding
> --
>
> Key: SPARK-13582
> URL: https://issues.apache.org/jira/browse/SPARK-13582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we replace the ids with value from a dictionary before accessing a 
> column. We could defer that, especially when some rows are filtered out, we 
> will not lookup this dictionary for those rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13582) Improve performance of parquet reader with dictionary encoding

2016-02-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172841#comment-15172841
 ] 

Apache Spark commented on SPARK-13582:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11437

> Improve performance of parquet reader with dictionary encoding
> --
>
> Key: SPARK-13582
> URL: https://issues.apache.org/jira/browse/SPARK-13582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we replace the ids with value from a dictionary before accessing a 
> column. We could defer that, especially when some rows are filtered out, we 
> will not lookup this dictionary for those rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13582) Improve performance of parquet reader with dictionary encoding

2016-02-29 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13582:
--

 Summary: Improve performance of parquet reader with dictionary 
encoding
 Key: SPARK-13582
 URL: https://issues.apache.org/jira/browse/SPARK-13582
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Right now, we replace the ids with value from a dictionary before accessing a 
column. We could defer that, especially when some rows are filtered out, we 
will not lookup this dictionary for those rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13571) Track current database in SQL/HiveContext

2016-02-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-13571:
--
Comment: was deleted

(was: User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11433)

> Track current database in SQL/HiveContext
> -
>
> Key: SPARK-13571
> URL: https://issues.apache.org/jira/browse/SPARK-13571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We already have internal APIs for Hive to do this. We should do it for 
> SQLContext too so we can merge these code paths one day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13571) Track current database in SQL/HiveContext

2016-02-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172825#comment-15172825
 ] 

Apache Spark commented on SPARK-13571:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/11433

> Track current database in SQL/HiveContext
> -
>
> Key: SPARK-13571
> URL: https://issues.apache.org/jira/browse/SPARK-13571
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We already have internal APIs for Hive to do this. We should do it for 
> SQLContext too so we can merge these code paths one day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13581) LibSVM throws MatchError

2016-02-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13581:
--
Assignee: Jeff Zhang

I guess the input it's hoping to treat as a vector type is just a double from 
the SVM input. As to why, I don't know; seems like a legit bug though.

> LibSVM throws MatchError
> 
>
> Key: SPARK-13581
> URL: https://issues.apache.org/jira/browse/SPARK-13581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Jakob Odersky
>Assignee: Jeff Zhang
>Priority: Minor
>
> When running an action on a DataFrame obtained by reading from a libsvm file 
> a MatchError is thrown, however doing the same on a cached DataFrame works 
> fine.
> {code}
> val df = 
> sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
> //file is
> df.select(df("features")).show() //MatchError
> df.cache()
> df.select(df("features")).show() //OK
> {code}
> The exception stack trace is the following:
> {code}
> scala.MatchError: 1.0 (of class java.lang.Double)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
> [info]at 
> org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
> [info]at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
> [info]at 
> org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
> {code}
> This issue first appeared in commit {{1dac964c1}}, in PR 
> [#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622.
> [~jeffzhang], do you have any insight of what could be going on?
> cc [~iyounus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172810#comment-15172810
 ] 

Liyin Tang commented on SPARK-13580:


Thanks [~zsxwing] for the investigation! That's very helpful !

> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, 
> stderrfiltered.txt.gz
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-29 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172796#comment-15172796
 ] 

Xiao Li commented on SPARK-13337:
-

Sorry, I do not get your point. Join-using-columns does not help in your case, 
right? It just removes the overlapping columns but it does not filter the 
values in the  results. 

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172786#comment-15172786
 ] 

Shixiong Zhu commented on SPARK-13580:
--

It hapens in an Akka thread. You can add {{-XX:OnOutOfMemoryError='kill -9 
%p'}} to the executor java options to force the executor exit for OOM

Or just upgrade to 1.6.0 which doesn't use Akka any more.

> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, 
> stderrfiltered.txt.gz
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13581) LibSVM throws MatchError

2016-02-29 Thread Jakob Odersky (JIRA)
Jakob Odersky created SPARK-13581:
-

 Summary: LibSVM throws MatchError
 Key: SPARK-13581
 URL: https://issues.apache.org/jira/browse/SPARK-13581
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Jakob Odersky
Priority: Minor


When running an action on a DataFrame obtained by reading from a libsvm file a 
MatchError is thrown, however doing the same on a cached DataFrame works fine.
{code}
val df = 
sqlContext.read.format("libsvm").load("../data/mllib/sample_libsvm_data.txt") 
//file is

df.select(df("features")).show() //MatchError

df.cache()
df.select(df("features")).show() //OK
{code}

The exception stack trace is the following:
{code}
scala.MatchError: 1.0 (of class java.lang.Double)
[info]  at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:207)
[info]  at org.apache.spark.mllib.linalg.VectorUDT.serialize(Vectors.scala:192)
[info]  at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$UDTConverter.toCatalystImpl(CatalystTypeConverters.scala:142)
[info]  at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
[info]  at 
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
[info]  at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:59)
[info]  at 
org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$2.apply(ExistingRDD.scala:56)
{code}

This issue first appeared in commit {{1dac964c1}}, in PR 
[#9595|https://github.com/apache/spark/pull/9595] fixing SPARK-11622.

[~jeffzhang], do you have any insight of what could be going on?

cc [~iyounus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13430) Expose ml summary function in PySpark for classification and regression models

2016-02-29 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172775#comment-15172775
 ] 

Bryan Cutler commented on SPARK-13430:
--

I can work on adding this

> Expose ml summary function in PySpark for classification and regression models
> --
>
> Key: SPARK-13430
> URL: https://issues.apache.org/jira/browse/SPARK-13430
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Reporter: Shubhanshu Mishra
>  Labels: classification, java, ml, mllib, pyspark, regression, 
> scala, sparkr
>
> I think model summary interface which is available in Spark's scala, Java and 
> R interfaces should also be available in the python interface. 
> Similar to #SPARK-11494
> https://issues.apache.org/jira/browse/SPARK-11494



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12817) Remove CacheManager and replace it with new BlockManager.getOrElseUpdate method

2016-02-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172773#comment-15172773
 ] 

Apache Spark commented on SPARK-12817:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11436

> Remove CacheManager and replace it with new BlockManager.getOrElseUpdate 
> method
> ---
>
> Key: SPARK-12817
> URL: https://issues.apache.org/jira/browse/SPARK-12817
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> CacheManager directly calls MemoryStore.unrollSafely() and has its own logic 
> for handling graceful fallback to disk when cached data does not fit in 
> memory. However, this logic also exists inside of the MemoryStore itself, so 
> this appears to be unnecessary duplication.
> Thanks to the addition of block-level read/write locks, we can refactor the 
> code to remove the CacheManager and replace it with an atomic getOrElseUpdate 
> BlockManager method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12817) Remove CacheManager and replace it with new BlockManager.getOrElseUpdate method

2016-02-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12817:
---
Description: 
CacheManager directly calls MemoryStore.unrollSafely() and has its own logic 
for handling graceful fallback to disk when cached data does not fit in memory. 
However, this logic also exists inside of the MemoryStore itself, so this 
appears to be unnecessary duplication.

Thanks to the addition of block-level read/write locks, we can refactor the 
code to remove the CacheManager and replace it with an atomic getOrElseUpdate 
BlockManager method.


  was:
CacheManager directly calls MemoryStore.unrollSafely() and has its own logic 
for handling graceful fallback to disk when cached data does not fit in memory. 
However, this logic also exists inside of the MemoryStore itself, so this 
appears to be unnecessary duplication.

We can remove this duplication and delete a significant amount of BlockManager 
code which existed only to support this CacheManager code.


> Remove CacheManager and replace it with new BlockManager.getOrElseUpdate 
> method
> ---
>
> Key: SPARK-12817
> URL: https://issues.apache.org/jira/browse/SPARK-12817
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> CacheManager directly calls MemoryStore.unrollSafely() and has its own logic 
> for handling graceful fallback to disk when cached data does not fit in 
> memory. However, this logic also exists inside of the MemoryStore itself, so 
> this appears to be unnecessary duplication.
> Thanks to the addition of block-level read/write locks, we can refactor the 
> code to remove the CacheManager and replace it with an atomic getOrElseUpdate 
> BlockManager method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12817) Remove CacheManager and replace it with new BlockManager.getOrElseUpdate method

2016-02-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12817:
---
Summary: Remove CacheManager and replace it with new 
BlockManager.getOrElseUpdate method  (was: Simplify CacheManager code and 
remove unused BlockManager methods)

> Remove CacheManager and replace it with new BlockManager.getOrElseUpdate 
> method
> ---
>
> Key: SPARK-12817
> URL: https://issues.apache.org/jira/browse/SPARK-12817
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> CacheManager directly calls MemoryStore.unrollSafely() and has its own logic 
> for handling graceful fallback to disk when cached data does not fit in 
> memory. However, this logic also exists inside of the MemoryStore itself, so 
> this appears to be unnecessary duplication.
> We can remove this duplication and delete a significant amount of 
> BlockManager code which existed only to support this CacheManager code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-29 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172709#comment-15172709
 ] 

Zhong Wang edited comment on SPARK-13337 at 2/29/16 10:05 PM:
--

For an outer join, it is difficult to eliminate the null columns from the 
result, because the null columns can come from both tables. The 
`join-using-column` interface can automatically eliminate those columns, which 
are very convenient. Sorry that I missed this point in my last reply.


was (Author: zwang):
For an outer join, it is difficult to eliminate the null columns from the 
result. The `join-using-column` interface can automatically eliminate those 
columns, which are very convenient. Sorry that I missed this point in my last 
reply.

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-29 Thread Zhong Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172709#comment-15172709
 ] 

Zhong Wang commented on SPARK-13337:


For an outer join, it is difficult to eliminate the null columns from the 
result. The `join-using-column` interface can automatically eliminate those 
columns, which are very convenient. Sorry that I missed this point in my last 
reply.

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172689#comment-15172689
 ] 

Shixiong Zhu commented on SPARK-13580:
--

OOM in the executor side:

Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler 
in thread "sparkExecutor-akka.actor.default-dispatcher-22"



> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, 
> stderrfiltered.txt.gz
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Jingwei Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172672#comment-15172672
 ] 

Jingwei Lu commented on SPARK-13580:


Attached the executor log for #11.[~zsxwing]

> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, 
> stderrfiltered.txt.gz
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Jingwei Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jingwei Lu updated SPARK-13580:
---
Attachment: stderrfiltered.txt.gz

> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack, 
> stderrfiltered.txt.gz
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172658#comment-15172658
 ] 

Shixiong Zhu edited comment on SPARK-13580 at 2/29/16 9:38 PM:
---

Could you post the executor #11 log?


was (Author: zsxwing):
Could you post the executor log?

> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172658#comment-15172658
 ] 

Shixiong Zhu commented on SPARK-13580:
--

Could you post the executor log?

> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter

2016-02-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172654#comment-15172654
 ] 

Steve Loughran commented on SPARK-10063:


sorry! HADOOP-9565

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Liyin Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated SPARK-13580:
---
Attachment: executor_jstack
driver_log.txt
driver_jstack.txt

> Driver makes no progress after failed to remove broadcast on Executor
> -
>
> Key: SPARK-13580
> URL: https://issues.apache.org/jira/browse/SPARK-13580
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liyin Tang
> Attachments: driver_jstack.txt, driver_log.txt, executor_jstack
>
>
> From Driver's log: it failed to remove broadcast data due to RPC timeout 
> exception from executor #11. And it also failed to get thread dump from 
> executor #11 due to akka.actor.ActorNotFound exception.
> After that, driver waited for executor #11 to finish one task for that job. 
> All the other tasks are finished for that job.
> However, from the executor#11's log, it didn't get that task (it got 9 other 
> tasks and finished them) 
> Since then, there is no progress in the streaming job. 
> I have attached the driver's log and jstack, executor's jstack. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13580) Driver makes no progress after failed to remove broadcast on Executor

2016-02-29 Thread Liyin Tang (JIRA)
Liyin Tang created SPARK-13580:
--

 Summary: Driver makes no progress after failed to remove broadcast 
on Executor
 Key: SPARK-13580
 URL: https://issues.apache.org/jira/browse/SPARK-13580
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.2
Reporter: Liyin Tang


>From Driver's log: it failed to remove broadcast data due to RPC timeout 
>exception from executor #11. And it also failed to get thread dump from 
>executor #11 due to akka.actor.ActorNotFound exception.

After that, driver waited for executor #11 to finish one task for that job. All 
the other tasks are finished for that job.

However, from the executor#11's log, it didn't get that task (it got 9 other 
tasks and finished them) 

Since then, there is no progress in the streaming job. 

I have attached the driver's log and jstack, executor's jstack. 








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13255) Integrate vectorized parquet scan with whole stage codegen.

2016-02-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172632#comment-15172632
 ] 

Apache Spark commented on SPARK-13255:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/11435

> Integrate vectorized parquet scan with whole stage codegen.
> ---
>
> Key: SPARK-13255
> URL: https://issues.apache.org/jira/browse/SPARK-13255
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Nong Li
>
> The generated whole stage codegen is intended to be run over batches of rows. 
> This task is to integrate ColumnarBatches with whole stage codegen.
> The resulting generated code should look something like:
> {code}
> Iterator input;
> void process() {
>   while (input.hasNext()) {
> ColumnarBatch batch = input.next();
> for (Row: batch) {
>   // Current function
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13478) Fetching delegation tokens for Hive fails when using proxy users

2016-02-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-13478.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.0.0

> Fetching delegation tokens for Hive fails when using proxy users
> 
>
> Key: SPARK-13478
> URL: https://issues.apache.org/jira/browse/SPARK-13478
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.0.0
>
>
> If you use spark-submit's proxy user support, the code that fetches 
> delegation tokens for the Hive Metastore fails. It seems like the Hive 
> library tries to connect to the Metastore as the proxy user, and it doesn't 
> have a Kerberos TGT for that user, so it fails.
> I don't know whether the same issue exists in the HBase code, but I'll make a 
> similar change so that both behave similarly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13123) Add wholestage codegen for sort

2016-02-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13123:
-
Assignee: Sameer Agarwal  (was: Nong Li)

> Add wholestage codegen for sort
> ---
>
> Key: SPARK-13123
> URL: https://issues.apache.org/jira/browse/SPARK-13123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> It should just implement CodegenSupport. It's future work to have this 
> operator use codegen more effectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13579) Stop building assemblies for Spark

2016-02-29 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-13579:
--

 Summary: Stop building assemblies for Spark
 Key: SPARK-13579
 URL: https://issues.apache.org/jira/browse/SPARK-13579
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin


See parent bug for more details. This change needs to wait for the other 
sub-tasks to be finished, so that the code knows what to do when there's only a 
bunch of jars to work with.

This should cover both maven and sbt builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13123) Add wholestage codegen for sort

2016-02-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-13123.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11359
[https://github.com/apache/spark/pull/11359]

> Add wholestage codegen for sort
> ---
>
> Key: SPARK-13123
> URL: https://issues.apache.org/jira/browse/SPARK-13123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 2.0.0
>
>
> It should just implement CodegenSupport. It's future work to have this 
> operator use codegen more effectively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13578) Make launcher lib and user scripts handle jar directories instead of single assembly file

2016-02-29 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-13578:
--

 Summary: Make launcher lib and user scripts handle jar directories 
instead of single assembly file
 Key: SPARK-13578
 URL: https://issues.apache.org/jira/browse/SPARK-13578
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin


See parent bug for details. This step is necessary before we can remove the 
assembly from the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13577) Allow YARN to handle multiple jars, archive when uploading Spark dependencies

2016-02-29 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-13577:
--

 Summary: Allow YARN to handle multiple jars, archive when 
uploading Spark dependencies
 Key: SPARK-13577
 URL: https://issues.apache.org/jira/browse/SPARK-13577
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Marcelo Vanzin


See parent bug for more details.

Before we remove assemblies from Spark, we need the YARN backend to understand 
how to find and upload multiple jars containing the Spark code. as a feature 
request made during spec review, we should also allow the Spark code to be 
provided as an archive that would be uploaded as a single file to the cluster, 
but exploded when downloaded to the containers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13575) Remove streaming backends' assemblies

2016-02-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-13575:
---
Component/s: (was: YARN)
 (was: Spark Core)
 Streaming

> Remove streaming backends' assemblies
> -
>
> Key: SPARK-13575
> URL: https://issues.apache.org/jira/browse/SPARK-13575
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Streaming
>Reporter: Marcelo Vanzin
>
> See parent bug for details. This task covers removing assemblies for 
> streaming backends.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13576) Make examples jar not be an assembly

2016-02-29 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-13576:
--

 Summary: Make examples jar not be an assembly
 Key: SPARK-13576
 URL: https://issues.apache.org/jira/browse/SPARK-13576
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Reporter: Marcelo Vanzin


See parent bug for details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13575) Remove streaming backends' assemblies

2016-02-29 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-13575:
--

 Summary: Remove streaming backends' assemblies
 Key: SPARK-13575
 URL: https://issues.apache.org/jira/browse/SPARK-13575
 Project: Spark
  Issue Type: Sub-task
Reporter: Marcelo Vanzin


See parent bug for details. This task covers removing assemblies for streaming 
backends.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype

2016-02-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12941:
-
Fix Version/s: 1.5.3
   1.4.2

> Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR 
> datatype
> --
>
> Key: SPARK-12941
> URL: https://issues.apache.org/jira/browse/SPARK-12941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: Apache Spark 1.4.2.2
>Reporter: Jose Martinez Poblete
>Assignee: Thomas Sebastian
> Fix For: 1.4.2, 1.5.3, 2.0.0, 1.6.2
>
>
> When exporting data from Spark to Oracle, string datatypes are translated to 
> TEXT for Oracle, this is leading to the following error
> {noformat}
> java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype
> {noformat}
> As per the following code:
> https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144
> See also:
> http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7253) Add example of belief propagation with GraphX

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7253.
--
Resolution: Workaround

An example was provided using the GraphFrames API 
(https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala).
 I marked this issue as "Workaround".

> Add example of belief propagation with GraphX
> -
>
> Key: SPARK-7253
> URL: https://issues.apache.org/jira/browse/SPARK-7253
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Joseph K. Bradley
>
> It would nice to document (via an example) how to use GraphX to do belief 
> propagation.  It's probably too much right now to talk about a full-fledged 
> graphical model library (and that would belong in MLlib anyways), but a 
> simple example of a graphical model + BP would be nice to add to GraphX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3665) Java API for GraphX

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3665.
--
Resolution: Workaround

GraphFrames (https://github.com/graphframes/graphframes) wraps GraphX 
algorithms under the DataFrames API and its Scala interface is compatible with 
Java. The project is in active development, currently as a 3rd-party package. 
I'm marking this issue as "Workaround" because it is much easier to support 
Java under the DataFrames API.

> Java API for GraphX
> ---
>
> Key: SPARK-3665
> URL: https://issues.apache.org/jira/browse/SPARK-3665
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX, Java API
>Affects Versions: 1.0.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> The Java API will wrap the Scala API in a similar manner as JavaRDD. 
> Components will include:
> # JavaGraph
> #- removes optional param from persist, subgraph, mapReduceTriplets, 
> Graph.fromEdgeTuples, Graph.fromEdges, Graph.apply
> #- removes implicit {{=:=}} param from mapVertices, outerJoinVertices
> #- merges multiple parameters lists
> #- incorporates GraphOps
> # JavaVertexRDD
> # JavaEdgeRDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3789) [GRAPHX] Python bindings for GraphX

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3789.
--
Resolution: Workaround

GraphFrames (https://github.com/graphframes/graphframes) wraps GraphX 
algorithms under the DataFrames API and it provides Python interface. The 
project is in active development, currently as a 3rd-party package. So I'm 
marking this issue as "Workaround" because it is much easier to support Python 
under the DataFrames API.

> [GRAPHX] Python bindings for GraphX
> ---
>
> Key: SPARK-3789
> URL: https://issues.apache.org/jira/browse/SPARK-3789
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, PySpark
>Reporter: Ameet Talwalkar
>Assignee: Kushal Datta
> Attachments: PyGraphX_design_doc.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7256) Add Graph abstraction which uses DataFrame

2016-02-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7256.
--
Resolution: Later

GraphFrames (https://github.com/graphframes/graphframes) implemented this idea 
and it is in active development. So I marked this issue as later.

> Add Graph abstraction which uses DataFrame
> --
>
> Key: SPARK-7256
> URL: https://issues.apache.org/jira/browse/SPARK-7256
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, SQL
>Reporter: Joseph K. Bradley
>Priority: Critical
>  Labels: dataframe, graphx
>
> RDD is to DataFrame as Graph is to ??? (this JIRA).
> It would be very useful long-term to have a Graph type which uses 2 
> DataFrames instead of 2 RDDs.
> The immediate benefit I have in mind is taking advantage of Spark SQL 
> datasources and storage formats.
> This could also be an opportunity to make an API which is more Java- and 
> Python-friendly.
> CC: [~ankurdave] [~rxin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13574) Improve parquet dictionary decoding for strings

2016-02-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172518#comment-15172518
 ] 

Apache Spark commented on SPARK-13574:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/11434

> Improve parquet dictionary decoding for strings
> ---
>
> Key: SPARK-13574
> URL: https://issues.apache.org/jira/browse/SPARK-13574
> Project: Spark
>  Issue Type: Improvement
>Reporter: Nong Li
>Priority: Minor
>
> Currently, the parquet reader will copy the dictionary value for each data 
> value. This is bad for string columns as we explode the dictionary during 
> decode. We should instead, have the data values point to the safe backing 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >