date:20170425

[jira] [Created] (SPARK-20464) Add a job group and an informative job description for streaming queries

2017-04-25 Thread Kunal Khamar (JIRA)

Kunal Khamar created SPARK-20464:


 Summary: Add a job group and an informative job description for 
streaming queries
 Key: SPARK-20464
 URL: https://issues.apache.org/jira/browse/SPARK-20464
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Kunal Khamar






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983820#comment-15983820
 ] 

Michael Armbrust commented on SPARK-18057:
--

I guess I'd like to understand more about what problems people are running into 
with the current version.  Are there more pressing issues than hanging when 
topics are deleted?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983832#comment-15983832
 ] 

Helena Edelson commented on SPARK-18057:


It is the timeout. I think waiting is better, will be watching that ticket in 
Kafka.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark

2017-04-25 Thread Michael Patterson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-20456:
--
Summary: Add examples for functions collection for pyspark  (was: Document 
major aggregation functions for pyspark)

> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark

2017-04-25 Thread Michael Patterson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-20456:
--
Description: 
Document `sql.functions.py`:

1. Add examples for the common aggregate functions (`min`, `max`, `mean`, 
`count`, `collect_set`, `collect_list`, `stddev`, `variance`)
2. Rename columns in datetime examples.
3. Add examples for `unix_timestamp` and `from_unixtime`
4. Add note to all trigonometry functions that units are radians.
5. Add example for `lit`

  was:
Document `sql.functions.py`:

1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
`collect_set`, `collect_list`, `stddev`, `variance`)
2. Rename columns in datetime examples.
3. Add examples for `unix_timestamp` and `from_unixtime`
4. Add note to all trigonometry functions that units are radians.
5. Document `lit`


> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, 
> `count`, `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Add example for `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark

2017-04-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20456:
-
Component/s: PySpark

> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, 
> `count`, `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Add example for `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18127) Add hooks and extension points to Spark

2017-04-25 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18127:

Fix Version/s: 2.2.0

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18127) Add hooks and extension points to Spark

2017-04-25 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18127.
-
Resolution: Fixed

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18127) Add hooks and extension points to Spark

2017-04-25 Thread Frederick Reiss (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983872#comment-15983872
 ] 

Frederick Reiss commented on SPARK-18127:
-

Is there a design document or a public design and requirements discussion 
associated with this JIRA?

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Sameer Agarwal
> Fix For: 2.2.0
>
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983888#comment-15983888
 ] 

Hyukjin Kwon commented on SPARK-20456:
--

I simply left the comment above as the current status does not sound matching 
with the description and the title. Let's fix the title and the description 
here.

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20464) Add a job group and an informative description for streaming queries

2017-04-25 Thread Kunal Khamar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-20464:
-
Summary: Add a job group and an informative description for streaming 
queries  (was: Add a job group and an informative job description for streaming 
queries)

> Add a job group and an informative description for streaming queries
> 
>
> Key: SPARK-20464
> URL: https://issues.apache.org/jira/browse/SPARK-20464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Helena Edelson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983796#comment-15983796
 ] 

Helena Edelson commented on SPARK-18057:


I have a branch off branch-2.2 with the 0.10.2.0 upgrade and changes done. All 
the delete-topic-related tests fail (mainly just in streaming kafka sql).

I can PR with those few tests commented out but that doesn't sound right. Or 
wait to PR?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983892#comment-15983892
 ] 

Hyukjin Kwon commented on SPARK-20445:
--

I meant the current codebase, latest build. Probably, I guess testing against 
2.1.0 might be enough.

In guide lines - http://spark.apache.org/contributing.html

> For issues that can’t be reproduced against master as reported, resolve as 
> Cannot Reproduce

I could not reproduce this in the current master. So, I resolved this as 
{{Cannot Reproduce}}. I assume there is a JIRA fixing this issue.
You (or anyone) can identify the JIRA and then make a backport if applicable. 

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel =

[jira] [Resolved] (SPARK-20457) Spark CSV is not able to Override Schema while reading data

2017-04-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20457.
--
Resolution: Duplicate

Currently, the nullability seems being ignored. I am pretty sure that it 
duplicates SPARK-19950. I am resolving this.

> Spark CSV is not able to Override Schema while reading data
> ---
>
> Key: SPARK-20457
> URL: https://issues.apache.org/jira/browse/SPARK-20457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Himanshu Gupta
>
> I have a CSV file, test.csv:
> {code:xml}
> col
> 1
> 2
> 3
> 4
> {code}
> When I read it using Spark, it gets the schema of data correct:
> {code:java}
> val df = spark.read.option("header", "true").option("inferSchema", 
> "true").csv("test.csv")
> 
> df.printSchema
> root
> |-- col: integer (nullable = true)
> {code}
> But when I override the `schema` of CSV file and make `inferSchema` false, 
> then SparkSession is picking up custom schema partially.
> {code:java}
> val df = spark.read.option("header", "true").option("inferSchema", 
> "false").schema(StructType(List(StructField("custom", StringType, 
> false.csv("test.csv")
> df.printSchema
> root
> |-- custom: string (nullable = true)
> {code}
> I mean only column name (`custom`) and DataType (`StringType`) are getting 
> picked up. But, `nullable` part is being ignored, as it is still coming 
> `nullable = true`, which is incorrect.
> I am not able to understand this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20464) Add a job group and an informative description for streaming queries

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20464:


Assignee: (was: Apache Spark)

> Add a job group and an informative description for streaming queries
> 
>
> Key: SPARK-20464
> URL: https://issues.apache.org/jira/browse/SPARK-20464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20464) Add a job group and an informative description for streaming queries

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20464:


Assignee: Apache Spark

> Add a job group and an informative description for streaming queries
> 
>
> Key: SPARK-20464
> URL: https://issues.apache.org/jira/browse/SPARK-20464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20130) Flaky test: BlockManagerProactiveReplicationSuite

2017-04-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20130.

Resolution: Cannot Reproduce

Seems a lot more stable now, so closing this until it becomes flaky again.

> Flaky test: BlockManagerProactiveReplicationSuite
> -
>
> Key: SPARK-20130
> URL: https://issues.apache.org/jira/browse/SPARK-20130
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> See following page:
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.storage.BlockManagerProactiveReplicationSuite_name=proactive+block+replication+-+5+replicas+-+4+block+manager+deletions
> I also have seen it fail intermittently during local unit test runs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20421:


Assignee: (was: Apache Spark)

> Mark JobProgressListener (and related classes) as deprecated
> 
>
> Key: SPARK-20421
> URL: https://issues.apache.org/jira/browse/SPARK-20421
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> This class (and others) were made {{@DeveloperApi}} as part of 
> https://github.com/apache/spark/pull/648. But as part of the work in 
> SPARK-18085, I plan to get rid of a lot of that code, so we should mark these 
> as deprecated in case anyone is still trying to use them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20421:


Assignee: Apache Spark

> Mark JobProgressListener (and related classes) as deprecated
> 
>
> Key: SPARK-20421
> URL: https://issues.apache.org/jira/browse/SPARK-20421
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> This class (and others) were made {{@DeveloperApi}} as part of 
> https://github.com/apache/spark/pull/648. But as part of the work in 
> SPARK-18085, I plan to get rid of a lot of that code, so we should mark these 
> as deprecated in case anyone is still trying to use them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983896#comment-15983896
 ] 

Hyukjin Kwon commented on SPARK-20336:
--

Thank you guys for confirming this.

> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables

2017-04-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-20439:

Fix Version/s: 2.1.1

> Catalog.listTables() depends on all libraries used to create tables
> ---
>
> Key: SPARK-20439
> URL: https://issues.apache.org/jira/browse/SPARK-20439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.1, 2.2.0, 2.3.0
>
>
> spark.catalog.listTables() and getTable
> You may get an error on the table serde library:
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> com.amazon.emr.kinesis.hive.KinesisHiveInputFormat
> Or if the database contains any table (e.g., index) with a table type that is 
> not accessible by Spark SQL, it will fail the whole listTable API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-20465:


 Summary: Throws a proper exception rather than 
ArrayIndexOutOfBoundsException when temp directories could not be got/created
 Key: SPARK-20465
 URL: https://issues.apache.org/jira/browse/SPARK-20465
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Trivial


If none of temp directories could not be created, it throws an 
{{ArrayIndexOutOfBoundsException}} as below:

{code}
./bin/spark-shell --conf 
spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
Ignoring this directory.
17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
Ignoring this directory.
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.repl.Main$.(Main.scala:37)
at org.apache.spark.repl.Main$.(Main.scala)
... 10 more
{code}

It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter

2017-04-25 Thread 颜发才


[ 
https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984161#comment-15984161
 ] 

Yan Facai (颜发才) commented on SPARK-20199:
-

The work is easy, however Public method is added and some adjustments are 
needed in inner implementation. Hence, I suggest to delay it until one expert 
agree to shepherd the issue.

I have two questions:

1. For both GBDT and RandomForest share the attribute,  we can pull 
`featureSubsetStrategy` parameter up to either TreeEnsembleParams or 
DecisionTreeParams. Which one is appropriate?

2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? 
Or add it to DecisionTree's train method?

> GradientBoostedTreesModel doesn't have  Column Sampling Rate Paramenter
> ---
>
> Key: SPARK-20199
> URL: https://issues.apache.org/jira/browse/SPARK-20199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: pralabhkumar
>
> Spark GradientBoostedTreesModel doesn't have Column  sampling rate parameter 
> . This parameter is available in H2O and XGBoost. 
> Sample from H2O.ai 
> gbmParams._col_sample_rate
> Please provide the parameter . 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20437) R wrappers for rollup and cube

2017-04-25 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-20437.
--
  Resolution: Fixed
Assignee: Maciej Szymkiewicz
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> R wrappers for rollup and cube
> --
>
> Key: SPARK-20437
> URL: https://issues.apache.org/jira/browse/SPARK-20437
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.3.0
>
>
> Add SparkR wrappers for {{Dataset.cube}} and {{Dataset.rollup}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984194#comment-15984194
 ] 

Liang-Chi Hsieh commented on SPARK-20392:
-

By disabling {{spark.sql.constraintPropagation.enabled}} flag as well to 
eliminate the cost of constraint propagation on big query plan, the running 
time can be further reduced to about 20 secs.

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
>

[jira] [Updated] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20465:
-
Component/s: Spark Core

> Throws a proper exception rather than ArrayIndexOutOfBoundsException when 
> temp directories could not be got/created
> ---
>
> Key: SPARK-20465
> URL: https://issues.apache.org/jira/browse/SPARK-20465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> If none of temp directories could not be created, it throws an 
> {{ArrayIndexOutOfBoundsException}} as below:
> {code}
> ./bin/spark-shell --conf 
> spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
> Ignoring this directory.
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
> Ignoring this directory.
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.repl.Main$.(Main.scala:37)
>   at org.apache.spark.repl.Main$.(Main.scala)
>   ... 10 more
> {code}
> It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984173#comment-15984173
 ] 

Liang-Chi Hsieh commented on SPARK-20392:
-

[~barrybecker4] Currently I think the performance downgrade is caused by the 
cost of exchange between DataFrame/Dataset abstraction and logical plans. Some 
operations on DataFrames exchange between DataFrame and logical plans. It can 
be ignored in the usage of SQL. However, it's not rare to chain dozens of 
pipeline stages in ML. When the query plan grows incrementally during running 
those stages, the cost spent on the exchange grows too. In particular, the 
Analyzer will go through the big query plan even most part of it is analyzed.

By eliminating part of the cost, the time to run the example code locally is 
reduced from about 1min to about 30 secs.


> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
>

[jira] [Resolved] (SPARK-16548) java.io.CharConversionException: Invalid UTF-32 character prevents me from querying my data

2017-04-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16548.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.0

> java.io.CharConversionException: Invalid UTF-32 character  prevents me from 
> querying my data
> 
>
> Key: SPARK-16548
> URL: https://issues.apache.org/jira/browse/SPARK-16548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Egor Pahomov
>Priority: Minor
> Fix For: 2.2.0, 2.3.0
>
>
> Basically, when I query my json data I get 
> {code}
> java.io.CharConversionException: Invalid UTF-32 character 0x7b2265(above 
> 10)  at char #192, byte #771)
>   at 
> com.fasterxml.jackson.core.io.UTF32Reader.reportInvalid(UTF32Reader.java:189)
>   at com.fasterxml.jackson.core.io.UTF32Reader.read(UTF32Reader.java:150)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.loadMore(ReaderBasedJsonParser.java:153)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1855)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:571)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetJsonObject$$anonfun$eval$2$$anonfun$4.apply(jsonExpressions.scala:142)
> {code}
> I do not like it. If you can not process one json among 100500 please return 
> null, do not fail everything. I have dirty one line fix, and I understand how 
> I can make it more reasonable. What is our position - what behaviour we wanna 
> get?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20465:


Assignee: Apache Spark

> Throws a proper exception rather than ArrayIndexOutOfBoundsException when 
> temp directories could not be got/created
> ---
>
> Key: SPARK-20465
> URL: https://issues.apache.org/jira/browse/SPARK-20465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> If none of temp directories could not be created, it throws an 
> {{ArrayIndexOutOfBoundsException}} as below:
> {code}
> ./bin/spark-shell --conf 
> spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
> Ignoring this directory.
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
> Ignoring this directory.
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.repl.Main$.(Main.scala:37)
>   at org.apache.spark.repl.Main$.(Main.scala)
>   ... 10 more
> {code}
> It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15984180#comment-15984180
 ] 

Apache Spark commented on SPARK-20465:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17768

> Throws a proper exception rather than ArrayIndexOutOfBoundsException when 
> temp directories could not be got/created
> ---
>
> Key: SPARK-20465
> URL: https://issues.apache.org/jira/browse/SPARK-20465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> If none of temp directories could not be created, it throws an 
> {{ArrayIndexOutOfBoundsException}} as below:
> {code}
> ./bin/spark-shell --conf 
> spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
> Ignoring this directory.
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
> Ignoring this directory.
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.repl.Main$.(Main.scala:37)
>   at org.apache.spark.repl.Main$.(Main.scala)
>   ... 10 more
> {code}
> It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20465) Throws a proper exception rather than ArrayIndexOutOfBoundsException when temp directories could not be got/created

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20465:


Assignee: (was: Apache Spark)

> Throws a proper exception rather than ArrayIndexOutOfBoundsException when 
> temp directories could not be got/created
> ---
>
> Key: SPARK-20465
> URL: https://issues.apache.org/jira/browse/SPARK-20465
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> If none of temp directories could not be created, it throws an 
> {{ArrayIndexOutOfBoundsException}} as below:
> {code}
> ./bin/spark-shell --conf 
> spark.local.dir=/NONEXISTENT_DIR_ONE,/NONEXISTENT_DIR_TWO
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_ONE. 
> Ignoring this directory.
> 17/04/26 13:11:06 ERROR Utils: Failed to create dir in /NONEXISTENT_DIR_TWO. 
> Ignoring this directory.
> Exception in thread "main" java.lang.ExceptionInInitializerError
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:756)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:179)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:204)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
>   at org.apache.spark.util.Utils$.getLocalDir(Utils.scala:743)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at org.apache.spark.repl.Main$$anonfun$1.apply(Main.scala:37)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.repl.Main$.(Main.scala:37)
>   at org.apache.spark.repl.Main$.(Main.scala)
>   ... 10 more
> {code}
> It seems we should throw a proper exception with better message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20239) Improve HistoryServer ACL mechanism

2017-04-25 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-20239:
---
Fix Version/s: 2.1.2
   2.0.3

> Improve HistoryServer ACL mechanism
> ---
>
> Key: SPARK-20239
> URL: https://issues.apache.org/jira/browse/SPARK-20239
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.0.3, 2.1.2, 2.2.0
>
>
> Current SHS (Spark History Server) two different ACLs. 
> * ACL of base URL, it is controlled by "spark.acls.enabled" or 
> "spark.ui.acls.enabled", and with this enabled, only user configured with 
> "spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user 
> who started SHS could list all the applications, otherwise none of them can 
> be listed. This will also affect REST APIs which listing the summary of all 
> apps and one app.
> * Per application ACL. This is controlled by "spark.history.ui.acls.enabled". 
> With this enabled only history admin user and user/group who ran this app can 
> access the details of this app. 
> With this two ACLs, we may encounter several unexpected behaviors:
> 1. if base URL's ACL is enabled but user A has no view permission. User "A" 
> cannot see the app list but could still access details of it's own app.
> 2. if ACLs of base URL is disabled. Then user "A" could see the summary of 
> all the apps, even some didn't run by user "A", but cannot access the details.
> 3. history admin ACL has no permission to list all apps if this admin user is 
> not added to base URL's ACL.
> The unexpected behaviors is mainly because we have two different ACLs, 
> ideally we should have only one to manage all.
> So to improve SHS's ACL mechanism, we should:
> * Unify two different ACLs into one, and always honor this one (both in base 
> URL and app details).
> * User could partially list and display apps which ran by him according to 
> the ACLs in event log.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20464) Add a job group and an informative description for streaming queries

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983751#comment-15983751
 ] 

Apache Spark commented on SPARK-20464:
--

User 'kunalkhamar' has created a pull request for this issue:
https://github.com/apache/spark/pull/17765

> Add a job group and an informative description for streaming queries
> 
>
> Key: SPARK-20464
> URL: https://issues.apache.org/jira/browse/SPARK-20464
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kunal Khamar
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983809#comment-15983809
 ] 

Shixiong Zhu commented on SPARK-18057:
--

I prefer to just wait. The user can still use Kafka 0.10.2.0 with the current 
Spark Kafka source in their application. The APIs are compatibility. Commenting 
tests out means we cannot prevent future changes from breaking them.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20421) Mark JobProgressListener (and related classes) as deprecated

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983808#comment-15983808
 ] 

Apache Spark commented on SPARK-20421:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17766

> Mark JobProgressListener (and related classes) as deprecated
> 
>
> Key: SPARK-20421
> URL: https://issues.apache.org/jira/browse/SPARK-20421
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> This class (and others) were made {{@DeveloperApi}} as part of 
> https://github.com/apache/spark/pull/648. But as part of the work in 
> SPARK-18085, I plan to get rid of a lot of that code, so we should mark these 
> as deprecated in case anyone is still trying to use them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983853#comment-15983853
 ] 

Ismael Juma commented on SPARK-18057:
-

It's worth noting that no-one is working on that ticket at the moment, so a fix 
may take some time. And even if it lands soon, it's likely to be in 0.11.0.0 
first (0.10.2.1 is being voted and will be out very soon).

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted

2017-04-25 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-20461:


 Summary: CachedKafkaConsumer may hang forever when it's interrupted
 Key: SPARK-20461
 URL: https://issues.apache.org/jira/browse/SPARK-20461
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Shixiong Zhu


CachedKafkaConsumer may hang forever when it's interrupted because of KAFKA-1894



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20447) spark mesos scheduler suppress call

2017-04-25 Thread Michael Gummelt (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983314#comment-15983314
 ] 

Michael Gummelt commented on SPARK-20447:
-

The scheduler doesn't support suppression, no, but it does reject offers for 
120s: 
https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L375,
 and this is configurable.

With Mesos' 1s offer cycle, this should allow offers to be sent to 119 other 
frameworks before being re-offered to Spark.

Is this not sufficient? 

> spark mesos scheduler suppress call
> ---
>
> Key: SPARK-20447
> URL: https://issues.apache.org/jira/browse/SPARK-20447
> Project: Spark
>  Issue Type: Wish
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Pavel Plotnikov
>Priority: Minor
>
>  The spark mesos scheduler never sends the suppress call to mesos to exclude 
> application from Mesos batch allocation process (HierarchicalDRF allocator) 
> when spark application is idle and there are no tasks in the queue. As a 
> result, the application receives 0% cluster share because of the dynamic 
> resource allocation while other applications, that need additional resources, 
> can’t receive an offer because they have bigger cluster share that is 
> significantly more than 0%
> About suppress call: 
> http://mesos.apache.org/documentation/latest/app-framework-development-guide/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20449) Upgrade breeze version to 0.13.1

2017-04-25 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-20449.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   3.0.0

Issue resolved by pull request 17746
[https://github.com/apache/spark/pull/17746]

> Upgrade breeze version to 0.13.1
> 
>
> Key: SPARK-20449
> URL: https://issues.apache.org/jira/browse/SPARK-20449
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 3.0.0, 2.2.0
>
>
> Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983387#comment-15983387
 ] 

Apache Spark commented on SPARK-20461:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17761

> CachedKafkaConsumer may hang forever when it's interrupted
> --
>
> Key: SPARK-20461
> URL: https://issues.apache.org/jira/browse/SPARK-20461
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> CachedKafkaConsumer may hang forever when it's interrupted because of 
> KAFKA-1894



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2017-04-25 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-5484.
-
  Resolution: Fixed
Assignee: dingding  (was: Ankur Dave)
   Fix Version/s: 2.3.0
  2.2.0
Target Version/s: 2.2.0, 2.3.0

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: dingding
> Fix For: 2.2.0, 2.3.0
>
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20461:


Assignee: Apache Spark

> CachedKafkaConsumer may hang forever when it's interrupted
> --
>
> Key: SPARK-20461
> URL: https://issues.apache.org/jira/browse/SPARK-20461
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> CachedKafkaConsumer may hang forever when it's interrupted because of 
> KAFKA-1894



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20461) CachedKafkaConsumer may hang forever when it's interrupted

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20461:


Assignee: (was: Apache Spark)

> CachedKafkaConsumer may hang forever when it's interrupted
> --
>
> Key: SPARK-20461
> URL: https://issues.apache.org/jira/browse/SPARK-20461
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> CachedKafkaConsumer may hang forever when it's interrupted because of 
> KAFKA-1894



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983396#comment-15983396
 ] 

Shixiong Zhu commented on SPARK-18057:
--

[~guozhang] We have a stress test to test Spark Kafka connector for various 
cases, and it will try to frequently delete / re-create topics.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983396#comment-15983396
 ] 

Shixiong Zhu edited comment on SPARK-18057 at 4/25/17 6:29 PM:
---

[~guozhang] We have a stress test to test Spark Kafka connector for various 
cases, and it will try to frequently delete / re-create topics.

Topic deletion may not be common in production, but it will just hang forever 
and waste resources. It requires users to set up some monitor scripts to 
restart a job if it doesn't make any progress for a long time.


was (Author: zsxwing):
[~guozhang] We have a stress test to test Spark Kafka connector for various 
cases, and it will try to frequently delete / re-create topics.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20459) JdbcUtils throws IllegalStateException: Cause already initialized after getting SQLException

2017-04-25 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20459:

Target Version/s: 2.2.0

> JdbcUtils throws IllegalStateException: Cause already initialized after 
> getting SQLException
> 
>
> Key: SPARK-20459
> URL: https://issues.apache.org/jira/browse/SPARK-20459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.0.2, 2.1.0
>Reporter: Jessie Yu
>Priority: Minor
>
> Testing some failure scenarios, and JdbcUtils throws an IllegalStateException 
> instead of the expected SQLException:
> {code}
> scala> 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils.saveTable(prodtbl,url3,"DB2.D_ITEM_INFO",prop1)
>  
> 17/04/03 17:19:35 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)  
>   
> java.lang.IllegalStateException: Cause already initialized
>   
> .at java.lang.Throwable.setCause(Throwable.java:365)  
>   
> .at java.lang.Throwable.initCause(Throwable.java:341) 
>   
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:241)
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:300)
> .at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:299)
> .at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> .at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:902)
> .at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>  
> .at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>   
> .at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   
> .at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   
> .at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 
> .at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153
> .at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628
> .at java.lang.Thread.run(Thread.java:785) 
>   
> {code}
> The code in JdbcUtils.savePartition has 
> {code}
> } catch {
>   case e: SQLException =>
> val cause = e.getNextException
> if (cause != null && e.getCause != cause) {
>   if (e.getCause == null) {
> e.initCause(cause)
>   } else {
> e.addSuppressed(cause)
>   }
> }
> {code}
> According to Throwable Java doc, {{initCause()}} throws an 
> {{IllegalStateException}} "if this throwable was created with 
> Throwable(Throwable) or Throwable(String,Throwable), or this method has 
> already been called on this throwable". The code does check whether {{cause}} 
> is {{null}} before initializing it. However, {{getCause()}} "returns the 
> cause of this throwable or null if the cause is nonexistent or unknown." In 
> other words, {{null}} is returned if {{cause}} already exists (which would 
> result in {{IllegalStateException}}) but is unknown. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2017-04-25 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983440#comment-15983440
 ] 

Xiao Li commented on SPARK-20427:
-

cc [~tsuresh] Are you interested in this? 

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20439) Catalog.listTables() depends on all libraries used to create tables

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983224#comment-15983224
 ] 

Apache Spark commented on SPARK-20439:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17760

> Catalog.listTables() depends on all libraries used to create tables
> ---
>
> Key: SPARK-20439
> URL: https://issues.apache.org/jira/browse/SPARK-20439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0, 2.3.0
>
>
> spark.catalog.listTables() and getTable
> You may get an error on the table serde library:
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> com.amazon.emr.kinesis.hive.KinesisHiveInputFormat
> Or if the database contains any table (e.g., index) with a table type that is 
> not accessible by Spark SQL, it will fail the whole listTable API.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9103) Tracking spark's memory usage

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983531#comment-15983531
 ] 

Apache Spark commented on SPARK-9103:
-

User 'jsoltren' has created a pull request for this issue:
https://github.com/apache/spark/pull/17762

> Tracking spark's memory usage
> -
>
> Key: SPARK-9103
> URL: https://issues.apache.org/jira/browse/SPARK-9103
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Reporter: Zhang, Liye
> Attachments: Tracking Spark Memory Usage - Phase 1.pdf
>
>
> Currently spark only provides little memory usage information (RDD cache on 
> webUI) for the executors. User have no idea on what is the memory consumption 
> when they are running spark applications with a lot of memory used in spark 
> executors. Especially when they encounter the OOM, it’s really hard to know 
> what is the cause of the problem. So it would be helpful to give out the 
> detail memory consumption information for each part of spark, so that user 
> can clearly have a picture of where the memory is exactly used. 
> The memory usage info to expose should include but not limited to shuffle, 
> cache, network, serializer, etc.
> User can optionally choose to open this functionality since this is mainly 
> for debugging and tuning.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Michael Patterson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983590#comment-15983590
 ] 

Michael Patterson commented on SPARK-20456:
---

I saw that there are short docstrings for the aggregate functions, but I think 
it can be unclear for people new to Spark, or relational algebra. For example, 
some of my coworkers didn't know you could do, for example, 
`df.agg(mean(col))`, without doing a `groupby`. There are also no links to 
`groupby` in any of the aggregate functions. I also didn't know about 
`collect_set` for a long time. I think adding examples would help with 
visibility and understanding.

The same things applies to `lit`. It took me a while to learn what it did.

For the datetime stuff, for example this line has a column named 'd': 
https://github.com/map222/spark/blob/master/python/pyspark/sql/functions.py#L926

I think it would be more informative to name it 'date' or 'time'.

Do these sound reasonable?

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20462) Spark-Kinesis Direct Connector

2017-04-25 Thread Lauren Moos (JIRA)

Lauren Moos created SPARK-20462:
---

 Summary: Spark-Kinesis Direct Connector 
 Key: SPARK-20462
 URL: https://issues.apache.org/jira/browse/SPARK-20462
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output
Affects Versions: 2.1.0
Reporter: Lauren Moos


I'd like to propose and the vet the design for a direct connector between Spark 
and Kinesis. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-13747:
-
Fix Version/s: (was: 2.2.0)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-13747:
--

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983612#comment-15983612
 ] 

Apache Spark commented on SPARK-13747:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17763

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13747:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13747:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983615#comment-15983615
 ] 

Shixiong Zhu commented on SPARK-13747:
--

[~dnaumenko] Unfortunately, Spark uses ThreadLocal variables a lot but 
ForkJoinPool doesn't support that very well (It's easy to leak ThreadLocal 
variables to other tasks). Could you check if 
https://github.com/apache/spark/pull/17763 can fix your issue?

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-04-25 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983490#comment-15983490
 ] 

Kazuaki Ishizaki commented on SPARK-20392:
--

Here are my observations:

According to 
[Bucketizer.transform()|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L104-L122]
* {{Projection}} is generated by calling {{dataset.withColumn()}} at line 121. 
Since {{Bucketizer.transform}} is called multiple times in the pipeline, 
deeply-nested projects are generated in the plan.
* {{bucketizer}} at line 119 is implemented as {{UDF}}. To call a {{UDF}} 
function leads to some overhead for DeSer.
* If a number of fields are less than 101 (i.e. 
{{"spark.sql.codegen.maxFields"}}), the wholestage codegen is enabled. For 
fewCols, it is enabled. However, in the original case, code generation is not 
enabled for the nested project.

Is it better approach to effectively use {{Bucketizer}} in this case? For 
example, can we reduce the number of calls of {{Bucketizer}}? cc:[~mlnick]


> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
>

[jira] [Created] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Michael Styles (JIRA)

Michael Styles created SPARK-20463:
--

 Summary: Expose SPARK SQL <=> operator in PySpark
 Key: SPARK-20463
 URL: https://issues.apache.org/jira/browse/SPARK-20463
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Michael Styles


Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
*isNotDistinctFrom*. For example:

{panel}
{noformat}
data = [(10, 20), (30, 30), (40, None), (None, None)]
df2 = sc.parallelize(data).toDF("c1", "c2")
df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
[Row(c1=30, c2=30), Row(c1=None, c2=None)]
{noformat}
{panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983654#comment-15983654
 ] 

Apache Spark commented on SPARK-20463:
--

User 'ptkool' has created a pull request for this issue:
https://github.com/apache/spark/pull/17764

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20463:


Assignee: Apache Spark

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Assignee: Apache Spark
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20463:


Assignee: (was: Apache Spark)

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20463:


Assignee: (was: Apache Spark)

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20463:


Assignee: Apache Spark

> Expose SPARK SQL <=> operator in PySpark
> 
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>Assignee: Apache Spark
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20455) Missing Test Target in Documentation for "Running Docker-based Integration Test Suites"

2017-04-25 Thread Armin Braun (JIRA)

Armin Braun created SPARK-20455:
---

 Summary: Missing Test Target in Documentation for "Running 
Docker-based Integration Test Suites"
 Key: SPARK-20455
 URL: https://issues.apache.org/jira/browse/SPARK-20455
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Armin Braun
Priority: Minor


The doc at 
http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites
 is missing the `test` goal in the second line of the Maven build description.

It should be:

{code}
./build/mvn install -DskipTests
./build/mvn test -Pdocker-integration-tests -pl 
:spark-docker-integration-tests_2.11
{code}

Adding a PR now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20455) Missing Test Target in Documentation for "Running Docker-based Integration Test Suites"

2017-04-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982420#comment-15982420
 ] 

Apache Spark commented on SPARK-20455:
--

User 'original-brownbear' has created a pull request for this issue:
https://github.com/apache/spark/pull/17756

> Missing Test Target in Documentation for "Running Docker-based Integration 
> Test Suites"
> ---
>
> Key: SPARK-20455
> URL: https://issues.apache.org/jira/browse/SPARK-20455
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Armin Braun
>Priority: Minor
>
> The doc at 
> http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites
>  is missing the `test` goal in the second line of the Maven build description.
> It should be:
> {code}
> ./build/mvn install -DskipTests
> ./build/mvn test -Pdocker-integration-tests -pl 
> :spark-docker-integration-tests_2.11
> {code}
> Adding a PR now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20455) Missing Test Target in Documentation for "Running Docker-based Integration Test Suites"

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20455:


Assignee: (was: Apache Spark)

> Missing Test Target in Documentation for "Running Docker-based Integration 
> Test Suites"
> ---
>
> Key: SPARK-20455
> URL: https://issues.apache.org/jira/browse/SPARK-20455
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Armin Braun
>Priority: Minor
>
> The doc at 
> http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites
>  is missing the `test` goal in the second line of the Maven build description.
> It should be:
> {code}
> ./build/mvn install -DskipTests
> ./build/mvn test -Pdocker-integration-tests -pl 
> :spark-docker-integration-tests_2.11
> {code}
> Adding a PR now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20455) Missing Test Target in Documentation for "Running Docker-based Integration Test Suites"

2017-04-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20455:


Assignee: Apache Spark

> Missing Test Target in Documentation for "Running Docker-based Integration 
> Test Suites"
> ---
>
> Key: SPARK-20455
> URL: https://issues.apache.org/jira/browse/SPARK-20455
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Armin Braun
>Assignee: Apache Spark
>Priority: Minor
>
> The doc at 
> http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites
>  is missing the `test` goal in the second line of the Maven build description.
> It should be:
> {code}
> ./build/mvn install -DskipTests
> ./build/mvn test -Pdocker-integration-tests -pl 
> :spark-docker-integration-tests_2.11
> {code}
> Adding a PR now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Michael Patterson (JIRA)

Michael Patterson created SPARK-20456:
-

 Summary: Document major aggregation functions for pyspark
 Key: SPARK-20456
 URL: https://issues.apache.org/jira/browse/SPARK-20456
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Michael Patterson


Document `sql.functions.py`:

1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
`collect_set`, `collect_list`, `stddev`, `variance`)
2. Rename columns in datetime examples.
3. Add examples for `unix_timestamp` and `from_unixtime`
4. Add note to all trigonometry functions that units are radians.
5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Strin

2017-04-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-20445.
--
Resolution: Cannot Reproduce

I am resolving this as I can't reproduce this in the current master. Please 
reopen this if this still exists in the current master.

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): 
> File "", line 1, in File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", 
> line 69, in fit return self._fit(dataset) File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", 
> line 133, in _fit java_model = self._fit_java(dataset) File 
>

[jira] [Commented] (SPARK-20369) pyspark: Dynamic configuration with SparkConf does not work

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982489#comment-15982489
 ] 

Hyukjin Kwon commented on SPARK-20369:
--

It looks I can't reproduce this as below:

{code}
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("spark-conf-test") \
.setMaster("local[2]") \
.set('spark.python.worker.memory',"1g") \
.set('spark.executor.memory',"3g") \
.set("spark.driver.maxResultSize","2g")

print
print "Spark Config values in SparkConf:"
print conf.toDebugString()

sc = SparkContext(conf=conf)
print
print "Actual Spark Config values:"
print sc.getConf().toDebugString()

print conf.get("spark.python.worker.memory") == 
sc.getConf().get("spark.python.worker.memory")
print conf.get("spark.executor.memory") == 
sc.getConf().get("spark.executor.memory")
print conf.get("spark.driver.maxResultSize") == 
sc.getConf().get("spark.driver.maxResultSize")
{code}

{code}
Spark Config values in SparkConf:
spark.master=local[2]
spark.executor.memory=3g
spark.python.worker.memory=1g
spark.app.name=spark-conf-test
spark.driver.maxResultSize=2g
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/04/25 16:20:08 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable

Actual Spark Config values:
spark.app.id=local-1493104809510
spark.app.name=spark-conf-test
spark.driver.extraClassPath=/Users/hyukjinkwon/Desktop/workspace/local/forked/spark-xml/target/scala-2.11/spark-xml_2.11-0.4.0.jar
spark.driver.host=192.168.15.168
spark.driver.maxResultSize=2g
spark.driver.port=56783
spark.executor.extraClassPath=/Users/hyukjinkwon/Desktop/workspace/local/forked/spark-xml/target/scala-2.11/spark-xml_2.11-0.4.0.jar
spark.executor.id=driver
spark.executor.memory=3g
spark.master=local[2]
spark.python.worker.memory=1g
spark.rdd.compress=True
spark.serializer.objectStreamReset=100
spark.submit.deployMode=client
True
True
True
{code}

Are you able to check this in the current master maybe?

> pyspark: Dynamic configuration with SparkConf does not work
> ---
>
> Key: SPARK-20369
> URL: https://issues.apache.org/jira/browse/SPARK-20369
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64) 
> and Mac OS X 10.11.6
>Reporter: Matthew McClain
>Priority: Minor
>
> Setting spark properties dynamically in pyspark using SparkConf object does 
> not work. Here is the code that shows the bug:
> ---
> from pyspark import SparkContext, SparkConf
> def main():
> conf = SparkConf().setAppName("spark-conf-test") \
> .setMaster("local[2]") \
> .set('spark.python.worker.memory',"1g") \
> .set('spark.executor.memory',"3g") \
> .set("spark.driver.maxResultSize","2g")
> print "Spark Config values in SparkConf:"
> print conf.toDebugString()
> sc = SparkContext(conf=conf)
> print "Actual Spark Config values:"
> print sc.getConf().toDebugString()
> if __name__  == "__main__":
> main()
> ---
> Here is the output; none of the config values set in SparkConf are used in 
> the SparkContext configuration:
> Spark Config values in SparkConf:
> spark.master=local[2]
> spark.executor.memory=3g
> spark.python.worker.memory=1g
> spark.app.name=spark-conf-test
> spark.driver.maxResultSize=2g
> 17/04/18 10:21:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Actual Spark Config values:
> spark.app.id=local-1492528885708
> spark.app.name=sandbox.py
> spark.driver.host=10.201.26.172
> spark.driver.maxResultSize=4g
> spark.driver.port=54657
> spark.executor.id=driver
> spark.files=file:/Users/matt.mcclain/dev/datascience-experiments/mmcclain/client_clusters/sandbox.py
> spark.master=local[*]
> spark.rdd.compress=True
> spark.serializer.objectStreamReset=100
> spark.submit.deployMode=client



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20369) pyspark: Dynamic configuration with SparkConf does not work

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982489#comment-15982489
 ] 

Hyukjin Kwon edited comment on SPARK-20369 at 4/25/17 7:24 AM:
---

It looks I can't reproduce this as below:

{code}
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("spark-conf-test") \
.setMaster("local[2]") \
.set('spark.python.worker.memory',"1g") \
.set('spark.executor.memory',"3g") \
.set("spark.driver.maxResultSize","2g")

print
print "Spark Config values in SparkConf:"
print conf.toDebugString()

sc = SparkContext(conf=conf)
print
print "Actual Spark Config values:"
print sc.getConf().toDebugString()

print conf.get("spark.python.worker.memory") == 
sc.getConf().get("spark.python.worker.memory")
print conf.get("spark.executor.memory") == 
sc.getConf().get("spark.executor.memory")
print conf.get("spark.driver.maxResultSize") == 
sc.getConf().get("spark.driver.maxResultSize")
{code}

{code}
Spark Config values in SparkConf:
spark.master=local[2]
spark.executor.memory=3g
spark.python.worker.memory=1g
spark.app.name=spark-conf-test
spark.driver.maxResultSize=2g
...

Actual Spark Config values:
...
spark.driver.maxResultSize=2g
spark.app.name=spark-conf-test
spark.executor.memory=3g
spark.master=local[2]
spark.python.worker.memory=1g
...
True
True
True
{code}

Are you able to check this in the current master maybe?


was (Author: hyukjin.kwon):
It looks I can't reproduce this as below:

{code}
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("spark-conf-test") \
.setMaster("local[2]") \
.set('spark.python.worker.memory',"1g") \
.set('spark.executor.memory',"3g") \
.set("spark.driver.maxResultSize","2g")

print
print "Spark Config values in SparkConf:"
print conf.toDebugString()

sc = SparkContext(conf=conf)
print
print "Actual Spark Config values:"
print sc.getConf().toDebugString()

print conf.get("spark.python.worker.memory") == 
sc.getConf().get("spark.python.worker.memory")
print conf.get("spark.executor.memory") == 
sc.getConf().get("spark.executor.memory")
print conf.get("spark.driver.maxResultSize") == 
sc.getConf().get("spark.driver.maxResultSize")
{code}

{code}
Spark Config values in SparkConf:
spark.master=local[2]
spark.executor.memory=3g
spark.python.worker.memory=1g
spark.app.name=spark-conf-test
spark.driver.maxResultSize=2g
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/04/25 16:20:08 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable

Actual Spark Config values:
spark.app.id=local-1493104809510
spark.app.name=spark-conf-test
spark.driver.extraClassPath=/Users/hyukjinkwon/Desktop/workspace/local/forked/spark-xml/target/scala-2.11/spark-xml_2.11-0.4.0.jar
spark.driver.host=192.168.15.168
spark.driver.maxResultSize=2g
spark.driver.port=56783
spark.executor.extraClassPath=/Users/hyukjinkwon/Desktop/workspace/local/forked/spark-xml/target/scala-2.11/spark-xml_2.11-0.4.0.jar
spark.executor.id=driver
spark.executor.memory=3g
spark.master=local[2]
spark.python.worker.memory=1g
spark.rdd.compress=True
spark.serializer.objectStreamReset=100
spark.submit.deployMode=client
True
True
True
{code}

Are you able to check this in the current master maybe?

> pyspark: Dynamic configuration with SparkConf does not work
> ---
>
> Key: SPARK-20369
> URL: https://issues.apache.org/jira/browse/SPARK-20369
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64) 
> and Mac OS X 10.11.6
>Reporter: Matthew McClain
>Priority: Minor
>
> Setting spark properties dynamically in pyspark using SparkConf object does 
> not work. Here is the code that shows the bug:
> ---
> from pyspark import SparkContext, SparkConf
> def main():
> conf = SparkConf().setAppName("spark-conf-test") \
> .setMaster("local[2]") \
> .set('spark.python.worker.memory',"1g") \
> .set('spark.executor.memory',"3g") \
> .set("spark.driver.maxResultSize","2g")
> print "Spark Config values in SparkConf:"
> print conf.toDebugString()
> sc = SparkContext(conf=conf)
> print "Actual Spark Config values:"
> print sc.getConf().toDebugString()
> if __name__  == "__main__":
> main()
> ---
> Here is the output; none of the config values set in SparkConf are used in 
> the SparkContext configuration:
> Spark Config values in SparkConf:
> spark.master=local[2]
> spark.executor.memory=3g
> spark.python.worker.memory=1g
> spark.app.name=spark-conf-test
> spark.driver.maxResultSize=2g
> 17/04/18

[jira] [Updated] (SPARK-20455) Missing Test Target in Documentation for "Running Docker-based Integration Test Suites"

2017-04-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20455:
--
Priority: Trivial  (was: Minor)

> Missing Test Target in Documentation for "Running Docker-based Integration 
> Test Suites"
> ---
>
> Key: SPARK-20455
> URL: https://issues.apache.org/jira/browse/SPARK-20455
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Armin Braun
>Priority: Trivial
>
> The doc at 
> http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites
>  is missing the `test` goal in the second line of the Maven build description.
> It should be:
> {code}
> ./build/mvn install -DskipTests
> ./build/mvn test -Pdocker-integration-tests -pl 
> :spark-docker-integration-tests_2.11
> {code}
> Adding a PR now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982507#comment-15982507
 ] 

Hyukjin Kwon commented on SPARK-20456:
--


> Document `sql.functions.py`:
1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
`collect_set`, `collect_list`, `stddev`, `variance`)

I think we have documentations for ...

min - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.min
max - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.max
mean - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.mean
count - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.count
collect_set - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_set
collect_list - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_list
stddev - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.stddev
variance - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.variance

in 
https://github.com/apache/spark/blob/3fbf0a5f9297f438bc92db11f106d4a0ae568613/python/pyspark/sql/functions.py

> 2. Rename columns in datetime examples.

Could you give some pointers?

> 5. Document `lit`

lit - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.lit

It seems documented.

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982507#comment-15982507
 ] 

Hyukjin Kwon edited comment on SPARK-20456 at 4/25/17 7:37 AM:
---

{quote}
Document `sql.functions.py`:
1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
`collect_set`, `collect_list`, `stddev`, `variance`)
{quote}

I think we have documentations for ...

min - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.min
max - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.max
mean - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.mean
count - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.count
collect_set - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_set
collect_list - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_list
stddev - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.stddev
variance - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.variance

in 
https://github.com/apache/spark/blob/3fbf0a5f9297f438bc92db11f106d4a0ae568613/python/pyspark/sql/functions.py

{quote}
2. Rename columns in datetime examples.
{quote}

Could you give some pointers?

{quote}
5. Document `lit`
{quote}

lit - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.lit

It seems documented.


was (Author: hyukjin.kwon):

> Document `sql.functions.py`:
1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
`collect_set`, `collect_list`, `stddev`, `variance`)

I think we have documentations for ...

min - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.min
max - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.max
mean - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.mean
count - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.count
collect_set - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_set
collect_list - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_list
stddev - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.stddev
variance - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.variance

in 
https://github.com/apache/spark/blob/3fbf0a5f9297f438bc92db11f106d4a0ae568613/python/pyspark/sql/functions.py

> 2. Rename columns in datetime examples.

Could you give some pointers?

> 5. Document `lit`

lit - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.lit

It seems documented.

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-20456:
-
Priority: Minor  (was: Major)

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20455) Missing Test Target in Documentation for "Running Docker-based Integration Test Suites"

2017-04-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20455:
-

Assignee: Armin Braun

> Missing Test Target in Documentation for "Running Docker-based Integration 
> Test Suites"
> ---
>
> Key: SPARK-20455
> URL: https://issues.apache.org/jira/browse/SPARK-20455
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Armin Braun
>Assignee: Armin Braun
>Priority: Trivial
> Fix For: 2.1.2, 2.2.0
>
>
> The doc at 
> http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites
>  is missing the `test` goal in the second line of the Maven build description.
> It should be:
> {code}
> ./build/mvn install -DskipTests
> ./build/mvn test -Pdocker-integration-tests -pl 
> :spark-docker-integration-tests_2.11
> {code}
> Adding a PR now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20455) Missing Test Target in Documentation for "Running Docker-based Integration Test Suites"

2017-04-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20455.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 17756
[https://github.com/apache/spark/pull/17756]

> Missing Test Target in Documentation for "Running Docker-based Integration 
> Test Suites"
> ---
>
> Key: SPARK-20455
> URL: https://issues.apache.org/jira/browse/SPARK-20455
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Armin Braun
>Priority: Trivial
> Fix For: 2.2.0, 2.1.2
>
>
> The doc at 
> http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites
>  is missing the `test` goal in the second line of the Maven build description.
> It should be:
> {code}
> ./build/mvn install -DskipTests
> ./build/mvn test -Pdocker-integration-tests -pl 
> :spark-docker-integration-tests_2.11
> {code}
> Adding a PR now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2017-04-25 Thread sskadarkar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982550#comment-15982550
 ] 

sskadarkar commented on SPARK-18492:


[~tdas]  I am also getting the same error which Rupinder has encountered for 
large windows.
 

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;
> /* 12269 */   if (!project_isNull252) {
> /* 12270 */ project_value252 = project_result249;
> /* 12271 */   }
> /*

[jira] [Assigned] (SPARK-20404) Regression with accumulator names when migrating from 1.6 to 2.x

2017-04-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-20404:
-

  Assignee: Sergey Zhemzhitsky
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Regression with accumulator names when migrating from 1.6 to 2.x
> 
>
> Key: SPARK-20404
> URL: https://issues.apache.org/jira/browse/SPARK-20404
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
> Environment: Spark: 2.1
> Scala: 2.11
> Spark master: local
>Reporter: Sergey Zhemzhitsky
>Assignee: Sergey Zhemzhitsky
>Priority: Minor
> Fix For: 2.1.2, 2.2.0
>
> Attachments: spark-context-accum-option.patch
>
>
> Creating accumulator with explicitly specified name equal to _null_, like the 
> following
> {code:java}
> sparkContext.accumulator(0, null)
> {code}
> throws exception at runtime
> {code:none}
> ERROR | DAGScheduler | dag-scheduler-event-loop | Failed to update 
> accumulators for task 0
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.AccumulatorV2$$anonfun$1.apply(AccumulatorV2.scala:106)
>   at 
> org.apache.spark.util.AccumulatorV2$$anonfun$1.apply(AccumulatorV2.scala:106)
>   at scala.Option.exists(Option.scala:240)
>   at org.apache.spark.util.AccumulatorV2.toInfo(AccumulatorV2.scala:106)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> The issue is in wrapping name into _Some_ instead of _Option_ when creating 
> accumulators.
> Patch is available.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20404) Regression with accumulator names when migrating from 1.6 to 2.x

2017-04-25 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20404.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.2.0

Issue resolved by pull request 17740
[https://github.com/apache/spark/pull/17740]

> Regression with accumulator names when migrating from 1.6 to 2.x
> 
>
> Key: SPARK-20404
> URL: https://issues.apache.org/jira/browse/SPARK-20404
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
> Environment: Spark: 2.1
> Scala: 2.11
> Spark master: local
>Reporter: Sergey Zhemzhitsky
> Fix For: 2.2.0, 2.1.2
>
> Attachments: spark-context-accum-option.patch
>
>
> Creating accumulator with explicitly specified name equal to _null_, like the 
> following
> {code:java}
> sparkContext.accumulator(0, null)
> {code}
> throws exception at runtime
> {code:none}
> ERROR | DAGScheduler | dag-scheduler-event-loop | Failed to update 
> accumulators for task 0
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.AccumulatorV2$$anonfun$1.apply(AccumulatorV2.scala:106)
>   at 
> org.apache.spark.util.AccumulatorV2$$anonfun$1.apply(AccumulatorV2.scala:106)
>   at scala.Option.exists(Option.scala:240)
>   at org.apache.spark.util.AccumulatorV2.toInfo(AccumulatorV2.scala:106)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> The issue is in wrapping name into _Some_ instead of _Option_ when creating 
> accumulators.
> Patch is available.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-25 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982587#comment-15982587
 ] 

Steven Rand commented on SPARK-7481:


What happened to https://github.com/apache/spark/pull/12004? It doesn't look 
like there were any concrete objections to the changes made there -- was it 
just closed for lack of a reviewer?

As someone who has spent several tens of hours (and counting!) debugging 
classpath issues for Spark applications that read from and write to an object 
store, I think this change is hugely valuable. I suspect that the large number 
of votes and watchers indicates that others think this as well, so it'd be 
pretty depressing if it didn't happen just because no one will review the 
patch. Unfortunately I'm not qualified to review it myself, but I'd be quite 
grateful if someone more competent were to do so.

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-25 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982591#comment-15982591
 ] 

Sean Owen commented on SPARK-7481:
--

I don't believe my last round of comments were addressed, and it was one of 
quite a lot of rounds. This is a real problem.

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17403) Fatal Error: Scan cached strings

2017-04-25 Thread Paul Lysak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982595#comment-15982595
 ] 

Paul Lysak commented on SPARK-17403:


Looks like we have the same issue with Spark 2.1 on YARN (Amazon EMR release 
emr-5.4.0).
Workaround that solves the issue for us (at the cost of some performance) is to 
use df.persist(StorageLevel.DISK_ONLY) instead of df.cache().

Depending on the node types, memory settings, storage level and some other 
factors I couldn't clearly identify it may appear as 

User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 158 in stage 504.0 failed 4 times, most recent failure: 
Lost task 158.3 in stage 504.0 (TID 427365, ip-10-35-162-171.ec2.internal, 
executor 83): java.lang.NegativeArraySizeException
at org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
at org.apache.spark.unsafe.types.UTF8String.clone(UTF8String.java:826)
at 
org.apache.spark.sql.execution.columnar.StringColumnStats.gatherStats(ColumnStats.scala:217)
at 
org.apache.spark.sql.execution.columnar.NullableColumnBuilder$class.appendFrom(NullableColumnBuilder.scala:55)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.org$apache$spark$sql$execution$columnar$compression$CompressibleColumnBuilder$$super$appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.appendFrom(CompressibleColumnBuilder.scala:78)
at 
org.apache.spark.sql.execution.columnar.NativeColumnBuilder.appendFrom(ColumnBuilder.scala:97)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:122)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)

or as 

User class threw exception: org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 27 in stage 61.0 failed 4 times, most recent failure: Lost 
task 27.3 in stage 61.0 (TID 36167, ip-10-35-162-149.ec2.internal, executor 1): 
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_38$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:107)
at 
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)

or as

2017-04-24 19:02:45,951 ERROR 
org.apache.spark.util.SparkUncaughtExceptionHandler: Uncaught exception in 
thread Thread[Executor task launch worker-3,5,main]
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_37$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:217)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anonfun$join$1.apply(HashJoin.scala:215)
at

[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-25 Thread Armin Braun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982669#comment-15982669
 ] 

Armin Braun commented on SPARK-20336:
-

[~priancho] my bad apparently in the above. I can't retrace the exact version I 
ran on (maybe I mistakenly ran an old revision, sorry about that).
But I see the same with `master` revision `31345fde82` from today.

{code}
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/04/25 12:14:55 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
17/04/25 12:14:57 WARN yarn.Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://192.168.178.57:4040
Spark context available as 'sc' (master = yarn, app id = 
application_1493115274587_0001).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.option("wholeFile", true).option("header", 
true).csv("file:///tmp/sample.csv").show()
+++-+
|col1|col2| col3|
+++-+
|   1|   a| text|
|   2|   b| テキスト|
|   3|   c|  텍스트|
|   4|   d|text
テキスト
텍스트|
|   5|   e| last|
+++-+



{code}



> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-04-25 Thread Ismael Juma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982683#comment-15982683
 ] 

Ismael Juma commented on SPARK-18057:
-

Thanks for the clarification [~zsxwing], that's helpful.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-04-25 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982695#comment-15982695
 ] 

Nick Pentreath commented on SPARK-13857:


I'm going to close this as superseded by SPARK-19535. 

However, the discussion here should still serve as a reference for making 
{{ALS.transform}} able to support ranking metrics for cross-validation in 
SPARK-14409.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-04-25 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath closed SPARK-13857.
--
Resolution: Duplicate

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cataly

2017-04-25 Thread kanika dhuria (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria updated SPARK-17922:
--
Attachment: spark_17922.tar.gz

Repro case

> ClassCastException java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator 
> cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection 
> -
>
> Key: SPARK-17922
> URL: https://issues.apache.org/jira/browse/SPARK-17922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kanika dhuria
> Attachments: spark_17922.tar.gz
>
>
> I am using spark 2.0
> Seeing class loading issue because the whole stage code gen is generating 
> multiple classes with same name as 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass"
> I am using dataframe transform. and within transform i use Osgi.
> Osgi replaces the thread context class loader to ContextFinder which looks at 
> all the class loaders in the stack to find out the new generated class and 
> finds the GeneratedClass with inner class GeneratedIterator byteclass 
> loader(instead of falling back to the byte class loader created by janino 
> compiler), since the class name is same that byte class loader loads the 
> class and returns GeneratedClass$GeneratedIterator instead of expected 
> GeneratedClass$UnsafeProjection.
> Can we generate different classes with different names or is it expected to 
> generate one class only? 
> This is the somewhat I am trying to do 
> {noformat} 
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import com.databricks.spark.avro._
>   def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = {
> //Initialize osgi
>  (rows:Iterator[Row]) => {
>  var outi = Iterator[Row]() 
>  while(rows.hasNext) {
>  val r = rows.next 
>  outi = outi.++(Iterator(Row(r.get(0  
>  } 
>  //val ors = Row("abc")   
>  //outi =outi.++( Iterator(ors))  
>  outi
>  }
>   }
> def transform1( outType:StructType) :((DataFrame) => DataFrame) = {
>  (d:DataFrame) => {
>   val inType = d.schema
>   val rdd = d.rdd.mapPartitions(exePart(outType))
>   d.sqlContext.createDataFrame(rdd, outType)
> }
>
>   }
> val df = spark.read.avro("file:///data/builds/a1.avro")
> val df1 = df.select($"id2").filter(false)
> val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, 
> true)::Nil))).createOrReplaceTempView("tbl0")
> spark.sql("insert overwrite table testtable select p1 from tbl0")
> {noformat} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17922) ClassCastException java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator cannot be cast to org.apache.spark.sql.cata

2017-04-25 Thread kanika dhuria (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982700#comment-15982700
 ] 

kanika dhuria commented on SPARK-17922:
---

Hi , I have attached the repro case for this issue. The zip has ReadMe with 
details of configuration steps that are required.
Can somebody please use that and review the change requested?

> ClassCastException java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator 
> cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeProjection 
> -
>
> Key: SPARK-17922
> URL: https://issues.apache.org/jira/browse/SPARK-17922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kanika dhuria
> Attachments: spark_17922.tar.gz
>
>
> I am using spark 2.0
> Seeing class loading issue because the whole stage code gen is generating 
> multiple classes with same name as 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass"
> I am using dataframe transform. and within transform i use Osgi.
> Osgi replaces the thread context class loader to ContextFinder which looks at 
> all the class loaders in the stack to find out the new generated class and 
> finds the GeneratedClass with inner class GeneratedIterator byteclass 
> loader(instead of falling back to the byte class loader created by janino 
> compiler), since the class name is same that byte class loader loads the 
> class and returns GeneratedClass$GeneratedIterator instead of expected 
> GeneratedClass$UnsafeProjection.
> Can we generate different classes with different names or is it expected to 
> generate one class only? 
> This is the somewhat I am trying to do 
> {noformat} 
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import com.databricks.spark.avro._
>   def exePart(out:StructType): ((Iterator[Row]) => Iterator[Row]) = {
> //Initialize osgi
>  (rows:Iterator[Row]) => {
>  var outi = Iterator[Row]() 
>  while(rows.hasNext) {
>  val r = rows.next 
>  outi = outi.++(Iterator(Row(r.get(0  
>  } 
>  //val ors = Row("abc")   
>  //outi =outi.++( Iterator(ors))  
>  outi
>  }
>   }
> def transform1( outType:StructType) :((DataFrame) => DataFrame) = {
>  (d:DataFrame) => {
>   val inType = d.schema
>   val rdd = d.rdd.mapPartitions(exePart(outType))
>   d.sqlContext.createDataFrame(rdd, outType)
> }
>
>   }
> val df = spark.read.avro("file:///data/builds/a1.avro")
> val df1 = df.select($"id2").filter(false)
> val df2 = df1.transform(transform1(StructType(StructField("p1", IntegerType, 
> true)::Nil))).createOrReplaceTempView("tbl0")
> spark.sql("insert overwrite table testtable select p1 from tbl0")
> {noformat} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-25 Thread HanCheol Cho (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982702#comment-15982702
 ] 

HanCheol Cho commented on SPARK-20336:
--

Thank you for your additiona test [~original-brownbear]].
I also did the same test in our Hadoop cluster and the result was same as 
follows.
{code}
$ spark-shell --master yarn --deploy-mode client
Setting default log level to "WARN".
...
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
  /_/
...

scala> spark.read.option("wholeFile", true).option("header", 
true).csv("file:///tmp/test.encoding.csv").show()
++++
|col1|col2|col3|
++++
|   1|   a|text|
|   2|   b||
|   3|   c|   �|
|   4|   d|text
...|
|   5|   e|last|
++++
{code}
So, it seems like a problem in our cluster.

[~hyukjin.kwon] I will close this issue since it is not a problem in Spark 
module.
Thank you for your help too.

Best wishes,
Han-Cheol


> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-20336) spark.read.csv() with wholeFile=True option fails to read non ASCII unicode characters

2017-04-25 Thread HanCheol Cho (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

HanCheol Cho closed SPARK-20336.

Resolution: Not A Bug

> spark.read.csv() with wholeFile=True option fails to read non ASCII unicode 
> characters
> --
>
> Key: SPARK-20336
> URL: https://issues.apache.org/jira/browse/SPARK-20336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0 (master branch is downloaded from Github)
> PySpark
>Reporter: HanCheol Cho
>
> I used spark.read.csv() method with wholeFile=True option to load data that 
> has multi-line records.
> However, non-ASCII characters are not properly loaded.
> The following is a sample data for test:
> {code:none}
> col1,col2,col3
> 1,a,text
> 2,b,テキスト
> 3,c,텍스트
> 4,d,"text
> テキスト
> 텍스트"
> 5,e,last
> {code}
> When it is loaded without wholeFile=True option, non-ASCII characters are 
> shown correctly although multi-line records are parsed incorrectly as follows:
> {code:none}
> testdf_default = spark.read.csv("test.encoding.csv", header=True)
> testdf_default.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b|テキスト|
> |   3|   c| 텍스트|
> |   4|   d|text|
> |テキスト|null|null|
> | 텍스트"|null|null|
> |   5|   e|last|
> ++++
> {code}
> When wholeFile=True option is used, non-ASCII characters are broken as 
> follows:
> {code:none}
> testdf_wholefile = spark.read.csv("test.encoding.csv", header=True, 
> wholeFile=True)
> testdf_wholefile.show()
> ++++
> |col1|col2|col3|
> ++++
> |   1|   a|text|
> |   2|   b||
> |   3|   c|   �|
> |   4|   d|text
> ...|
> |   5|   e|last|
> ++++
> {code}
> The result is same even if I use encoding="UTF-8" option with wholeFile=True.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-20457) Spark CSV is not able to Override Schema while reading data

2017-04-25 Thread Himanshu Gupta (JIRA)

Himanshu Gupta created SPARK-20457:
--

 Summary: Spark CSV is not able to Override Schema while reading 
data
 Key: SPARK-20457
 URL: https://issues.apache.org/jira/browse/SPARK-20457
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Himanshu Gupta


I have a CSV file, test.csv:

{code:csv}
col
1
2
3
4
{code}

When I read it using Spark, it gets the schema of data correct:

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"true").csv("test.csv")

df.printSchema
root
 |-- col: integer (nullable = true)
{code}

But when I override the `schema` of CSV file and make `inferSchema` false, then 
SparkSession is picking up custom schema partially.

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"false").schema(StructType(List(StructField("custom", StringType, 
false.csv("test.csv")

df.printSchema
root
 |-- custom: string (nullable = true)
{code}

I mean only column name (`custom`) and DataType (`StringType`) are getting 
picked up. But, `nullable` part is being ignored, as it is still coming 
`nullable = true`, which is incorrect.

I am not able to understand this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20457) Spark CSV is not able to Override Schema while reading data

2017-04-25 Thread Himanshu Gupta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Himanshu Gupta updated SPARK-20457:
---
Description: 
I have a CSV file, test.csv:

{code:xml}
col
1
2
3
4
{code}

When I read it using Spark, it gets the schema of data correct:

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"true").csv("test.csv")

df.printSchema
root
|-- col: integer (nullable = true)
{code}

But when I override the `schema` of CSV file and make `inferSchema` false, then 
SparkSession is picking up custom schema partially.

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"false").schema(StructType(List(StructField("custom", StringType, 
false.csv("test.csv")

df.printSchema
root
|-- custom: string (nullable = true)
{code}

I mean only column name (`custom`) and DataType (`StringType`) are getting 
picked up. But, `nullable` part is being ignored, as it is still coming 
`nullable = true`, which is incorrect.

I am not able to understand this behavior.

  was:
I have a CSV file, test.csv:

{code:csv}
col
1
2
3
4
{code}

When I read it using Spark, it gets the schema of data correct:

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"true").csv("test.csv")

df.printSchema
root
 |-- col: integer (nullable = true)
{code}

But when I override the `schema` of CSV file and make `inferSchema` false, then 
SparkSession is picking up custom schema partially.

{code:java}
val df = spark.read.option("header", "true").option("inferSchema", 
"false").schema(StructType(List(StructField("custom", StringType, 
false.csv("test.csv")

df.printSchema
root
 |-- custom: string (nullable = true)
{code}

I mean only column name (`custom`) and DataType (`StringType`) are getting 
picked up. But, `nullable` part is being ignored, as it is still coming 
`nullable = true`, which is incorrect.

I am not able to understand this behavior.


> Spark CSV is not able to Override Schema while reading data
> ---
>
> Key: SPARK-20457
> URL: https://issues.apache.org/jira/browse/SPARK-20457
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Himanshu Gupta
>
> I have a CSV file, test.csv:
> {code:xml}
> col
> 1
> 2
> 3
> 4
> {code}
> When I read it using Spark, it gets the schema of data correct:
> {code:java}
> val df = spark.read.option("header", "true").option("inferSchema", 
> "true").csv("test.csv")
> 
> df.printSchema
> root
> |-- col: integer (nullable = true)
> {code}
> But when I override the `schema` of CSV file and make `inferSchema` false, 
> then SparkSession is picking up custom schema partially.
> {code:java}
> val df = spark.read.option("header", "true").option("inferSchema", 
> "false").schema(StructType(List(StructField("custom", StringType, 
> false.csv("test.csv")
> df.printSchema
> root
> |-- custom: string (nullable = true)
> {code}
> I mean only column name (`custom`) and DataType (`StringType`) are getting 
> picked up. But, `nullable` part is being ignored, as it is still coming 
> `nullable = true`, which is incorrect.
> I am not able to understand this behavior.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri

2017-04-25 Thread surya pratap (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982830#comment-15982830
 ] 

surya pratap commented on SPARK-20445:
--

Hello Hyukjin Kwon 
Thxz for reply.
I tried many times but getting same issue.
I am using Spark 1.6.1 version.
Please help me to sort out this issue.

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): 
> File "", line 1, in File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", 
> line 69, in fit return self._fit(dataset) File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", 
> line 133, in _fit java_model =

[jira] [Commented] (SPARK-20445) pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was given input with invalid label column label, without the number of classes specified. See Stri

2017-04-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982852#comment-15982852
 ] 

Hyukjin Kwon commented on SPARK-20445:
--

Are you maybe able to try this against the current master or higher versions?

> pyspark.sql.utils.IllegalArgumentException: u'DecisionTreeClassifier was 
> given input with invalid label column label, without the number of classes 
> specified. See StringIndexer
> 
>
> Key: SPARK-20445
> URL: https://issues.apache.org/jira/browse/SPARK-20445
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: surya pratap
>
>  #Load the CSV file into a RDD
> irisData = sc.textFile("/home/infademo/surya/iris.csv")
> irisData.cache()
> irisData.count()
> #Remove the first line (contains headers)
> dataLines = irisData.filter(lambda x: "Sepal" not in x)
> dataLines.count()
> from pyspark.sql import Row
> #Create a Data Frame from the data
> parts = dataLines.map(lambda l: l.split(","))
> irisMap = parts.map(lambda p: Row(SEPAL_LENGTH=float(p[0]),\
> SEPAL_WIDTH=float(p[1]), \
> PETAL_LENGTH=float(p[2]), \
> PETAL_WIDTH=float(p[3]), \
> SPECIES=p[4] ))
> # Infer the schema, and register the DataFrame as a table.
> irisDf = sqlContext.createDataFrame(irisMap)
> irisDf.cache()
> #Add a numeric indexer for the label/target column
> from pyspark.ml.feature import StringIndexer
> stringIndexer = StringIndexer(inputCol="SPECIES", outputCol="IND_SPECIES")
> si_model = stringIndexer.fit(irisDf)
> irisNormDf = si_model.transform(irisDf)
> irisNormDf.select("SPECIES","IND_SPECIES").distinct().collect()
> irisNormDf.cache()
> 
> """--
> Perform Data Analytics
> 
> -"""
> #See standard parameters
> irisNormDf.describe().show()
> #Find correlation between predictors and target
> for i in irisNormDf.columns:
> if not( isinstance(irisNormDf.select(i).take(1)[0][0], basestring)) :
> print( "Correlation to Species for ", i, \
> irisNormDf.stat.corr('IND_SPECIES',i))
> #Transform to a Data Frame for input to Machine Learing
> #Drop columns that are not required (low correlation)
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.linalg import SparseVector
> from pyspark.mllib.regression import LabeledPoint
> from pyspark.mllib.util import MLUtils
> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
> from pyspark.mllib.linalg.distributed import RowMatrix
> from pyspark.ml.linalg import Vectors
> pyspark.mllib.linalg.Vector
> def transformToLabeledPoint(row) :
> lp = ( row["SPECIES"], row["IND_SPECIES"], \
> Vectors.dense([row["SEPAL_LENGTH"],\
> row["SEPAL_WIDTH"], \
> row["PETAL_LENGTH"], \
> row["PETAL_WIDTH"]]))
> return lp
> irisLp = irisNormDf.rdd.map(transformToLabeledPoint)
> irisLpDf = sqlContext.createDataFrame(irisLp,["species","label", 
> "features"])
> irisLpDf.select("species","label","features").show(10)
> irisLpDf.cache()
> 
> """--
> Perform Machine Learning
> 
> -"""
> #Split into training and testing data
> (trainingData, testData) = irisLpDf.randomSplit([0.9, 0.1])
> trainingData.count()
> testData.count()
> testData.collect()
> from pyspark.ml.classification import DecisionTreeClassifier
> from pyspark.ml.evaluation import MulticlassClassificationEvaluator
> #Create the model
> dtClassifer = DecisionTreeClassifier(maxDepth=2, labelCol="label",\
> featuresCol="features")
>dtModel = dtClassifer.fit(trainingData)
>
>issue part:-
>
>dtModel = dtClassifer.fit(trainingData) Traceback (most recent call last): 
> File "", line 1, in File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/pipeline.py", 
> line 69, in fit return self._fit(dataset) File 
> "/opt/mapr/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/ml/wrapper.py", 
> line 133, in _fit java_model = self._fit_java(dataset) File 
>

[jira] [Created] (SPARK-20458) support getting Yarn Tracking URL in code

2017-04-25 Thread PJ Fanning (JIRA)

PJ Fanning created SPARK-20458:
--

 Summary: support getting Yarn Tracking URL in code
 Key: SPARK-20458
 URL: https://issues.apache.org/jira/browse/SPARK-20458
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.1.0
Reporter: PJ Fanning


org.apache.spark.deploy.yarn.Client logs the Yarn tracking URL but it would be 
useful to be able to access this in code, as opposed to mining log output.

I have an application where I monitor the health of the SparkContext and 
associated Executors using the Spark REST API.

Would it be feasible to add a listener API to listen for new ApplicationReports 
in org.apache.spark.deploy.yarn.Client? Alternatively, this URL could be 
exposed as a property associated with the SparkContext.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-04-25 Thread Dmitry Naumenko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982949#comment-15982949
 ] 

Dmitry Naumenko commented on SPARK-13747:
-

[~zsxwing] I did a similar test with join and have the same error in 2.2.0 
(actual query here - 
https://github.com/dnaumenko/spark-realtime-analytics-sample/blob/master/samples/src/main/scala/com/github/sparksample/httpapp/SimpleServer.scala).
 

My test setup is a simple akka-http long-running application and separate 
Gatling script that spawns multiple requests for join query 
(https://github.com/dnaumenko/spark-realtime-analytics-sample/blob/master/loadtool/src/main/scala/com/github/sparksample/SimpleServerSimulation.scala
 is test simulation script).

[~barrybecker4] I've managed to fix the problem by switching akka's default 
executor to thread-pool. But I guess the root cause is that Spark is relying on 
ThreadLocal variables and manages them incorrectly.

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20446) Optimize the process of MLLIB ALS recommendForAll

2017-04-25 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983018#comment-15983018
 ] 

Nick Pentreath commented on SPARK-20446:


By the way when I say it is a duplicate I mean for the JIRA ticket. I agree 
that PR9980 was not the correct solution - JIRA tickets can have multiple PRs 
linked to them.

I'd prefer to close this ticket and move the discussion to SPARK-11968 (also 
there are watchers on that ticket that may be interested in the outcome). 



> Optimize the process of MLLIB ALS recommendForAll
> -
>
> Key: SPARK-20446
> URL: https://issues.apache.org/jira/browse/SPARK-20446
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>
> The recommendForAll of MLLIB ALS is very slow.
> GC is a key problem of the current method. 
> The task use the following code to keep temp result:
> val output = new Array[(Int, (Int, Double))](m*n)
> m = n = 4096 (default value, no method to set)
> so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
> cause serious GC problem, and it is frequently OOM.
> Actually, we don't need to save all the temp result. Suppose we recommend 
> topK (topK is about 10, or 20) product for each user, we only need  4k * topK 
> * (4 + 4 + 8) memory to save the temp result.
> I have written a solution for this method with the following test result. 
> The Test Environment:
> 3 workers: each work 10 core, each work 30G memory, each work 1 executor.
> The Data: User 480,000, and Item 17,000
> BlockSize: 1024 2048 4096 8192
> Old method: 245s 332s 488s OOM
> This solution: 121s 118s 117s 120s
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11968) ALS recommend all methods spend most of time in GC

2017-04-25 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983026#comment-15983026
 ] 

Nick Pentreath commented on SPARK-11968:


Note, there is a solution proposed in SPARK-20446. I've redirected the 
discussion to this original JIRA.

> ALS recommend all methods spend most of time in GC
> --
>
> Key: SPARK-11968
> URL: https://issues.apache.org/jira/browse/SPARK-11968
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> After adding recommendUsersForProducts and recommendProductsForUsers to ALS 
> in spark-perf, I noticed that it takes much longer than ALS itself.  Looking 
> at the monitoring page, I can see it is spending about 8min doing GC for each 
> 10min task.  That sounds fixable.  Looking at the implementation, there is 
> clearly an opportunity to avoid extra allocations: 
> [https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]
> CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 123 matches

Mail list logo