[jira] [Commented] (SPARK-15678) Not use cache on appends and overwrites

2017-02-21 Thread Fei Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877618#comment-15877618
 ] 

Fei Shao commented on SPARK-15678:
--

I  checked  code with  refreshByPath, it worked well.
But  I do not think it is reasonable solution,  can we make it work without 
refreshByPath please?
Or why we should call refreshByPath before please? Is there any special reason 
please?

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-02-21 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877587#comment-15877587
 ] 

Saisai Shao commented on SPARK-19688:
-

Can you please elaborate the problem you met, otherwise it is hard for others 
to identify.

Also "spark.yarn.credentials.file" is a internal configuration, usually user 
should not configure it.

> Spark on Yarn Credentials File set to different application directory
> -
>
> Key: SPARK-19688
> URL: https://issues.apache.org/jira/browse/SPARK-19688
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.3
>Reporter: Devaraj Jonnadula
>Priority: Minor
>
> spark.yarn.credentials.file property is set to different application Id 
> instead of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19617) Fix the race condition when starting and stopping a query quickly

2017-02-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877578#comment-15877578
 ] 

Apache Spark commented on SPARK-19617:
--

User 'gf53520' has created a pull request for this issue:
https://github.com/apache/spark/pull/16980

> Fix the race condition when starting and stopping a query quickly
> -
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> The streaming thread in StreamExecution uses the following ways to check if 
> it should exit:
> - Catch an InterruptException.
> - `StreamExecution.state` is TERMINATED.
> when starting and stopping a query quickly, the above two checks may both 
> fail.
> - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and 
> swallow InterruptException
> - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then 
> [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252]
>  changes the state from `TERMINATED` to `ACTIVE`.
> If the above cases both happen, the query will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19639) Add spark.svmLinear example and update vignettes

2017-02-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877576#comment-15877576
 ] 

Felix Cheung commented on SPARK-19639:
--

right, I left it out because of the programming guide which was fixed in 
another PR.

> Add spark.svmLinear example and update vignettes
> 
>
> Key: SPARK-19639
> URL: https://issues.apache.org/jira/browse/SPARK-19639
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Miao Wang
> Fix For: 2.2.0
>
>
> We recently add the spark.svmLinear API for SparkR. We need to add an example 
> and update the vignettes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15678) Not use cache on appends and overwrites

2017-02-21 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877575#comment-15877575
 ] 

Kazuaki Ishizaki commented on SPARK-15678:
--

How about insert {{spark.catalog.refreshByPath()}} as follows?

{code}
spark.range(1000).write.mode("overwrite").parquet(dir)
spark.catalog.refreshByPath(dir)  // insert a NEW statement
val df1 = spark.read.parquet(dir)
df1.count
f(df1).count
{code}

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19639) Add spark.svmLinear example and update vignettes

2017-02-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19639.
--
  Resolution: Fixed
Assignee: Miao Wang
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> Add spark.svmLinear example and update vignettes
> 
>
> Key: SPARK-19639
> URL: https://issues.apache.org/jira/browse/SPARK-19639
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.2.0
>
>
> We recently add the spark.svmLinear API for SparkR. We need to add an example 
> and update the vignettes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15678) Not use cache on appends and overwrites

2017-02-21 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877575#comment-15877575
 ] 

Kazuaki Ishizaki edited comment on SPARK-15678 at 2/22/17 6:37 AM:
---

How about inserting {{spark.catalog.refreshByPath()}} as follows?

{code}
spark.range(1000).write.mode("overwrite").parquet(dir)
spark.catalog.refreshByPath(dir)  // insert a NEW statement
val df1 = spark.read.parquet(dir)
df1.count
f(df1).count
{code}


was (Author: kiszk):
How about insert {{spark.catalog.refreshByPath()}} as follows?

{code}
spark.range(1000).write.mode("overwrite").parquet(dir)
spark.catalog.refreshByPath(dir)  // insert a NEW statement
val df1 = spark.read.parquet(dir)
df1.count
f(df1).count
{code}

> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19642) Improve the security guarantee for rest api and ui

2017-02-21 Thread Genmao Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Genmao Yu closed SPARK-19642.
-
Resolution: Won't Fix

> Improve the security guarantee for rest api and ui
> --
>
> Key: SPARK-19642
> URL: https://issues.apache.org/jira/browse/SPARK-19642
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>
> As Spark gets more and more features, data may start leaking through other 
> places (e.g. SQL query plans which are shown in the UI). Also current rest 
> api may be a security hole. Open this JIRA to research and address the 
> potential security flaws.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15678) Not use cache on appends and overwrites

2017-02-21 Thread Gen TANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877513#comment-15877513
 ] 

Gen TANG edited comment on SPARK-15678 at 2/22/17 5:41 AM:
---

hi, I found a bug which is probably related with this issue.[~sameerag]
Please consider the following code.

{code}
import org.apache.spark.sql.DataFrame

def f(data: DataFrame): DataFrame = {
  val df = data.filter("id>10")
  df.cache
  df.count
  df
}

f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is correct
f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which is 
correct

val dir = "/tmp/test"
spark.range(100).write.mode("overwrite").parquet(dir)
val df = spark.read.parquet(dir)
df.count // output 100 which is correct
f(df).count // output 89 which is correct

spark.range(1000).write.mode("overwrite").parquet(dir)
val df1 = spark.read.parquet(dir)
df1.count // output 1000 which is correct, in fact other operation expect 
df1.filter("id>10") return correct result.
f(df1).count // output 89 which is incorrect
{code}

In fact, when we use df1.filter("id>10"), spark will use old cached dataFrame.


was (Author: gen):
hi, I found a bug which is probably related with this issue.[~sameerag]
Please consider the following code.

{code}
import org.apache.spark.sql.DataFrame

def f(data: DataFrame): DataFrame = {
  val df = data.filter("id>10")
  df.cache
  df.count
  df
}

f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is correct
f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which is 
correct

val dir = "/tmp/test"
spark.range(100).write.mode("overwrite").parquet(dir)
val df = spark.read.parquet(dir)
df.count // output 100 which is correct
f(df).count // output 89 which is correct

spark.range(1000).write.mode("overwrite").parquet(dir)
val df1 = spark.read.parquet(dir)
df1.count // output 1000 which is correct, in fact other operation expect 
df1.filter("id>10") return correct result.
f(df1).count // output 89 which is incorrect
{code}


> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15678) Not use cache on appends and overwrites

2017-02-21 Thread Gen TANG (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877513#comment-15877513
 ] 

Gen TANG commented on SPARK-15678:
--

hi, I found a bug which is probably related with this issue.[~sameerag]
Please consider the following code.

{code}
import org.apache.spark.sql.DataFrame

def f(data: DataFrame): DataFrame = {
  val df = data.filter("id>10")
  df.cache
  df.count
  df
}

f(spark.range(100).asInstanceOf[DataFrame]).count // output 89 which is correct
f(spark.range(1000).asInstanceOf[DataFrame]).count // output 989 which is 
correct

val dir = "/tmp/test"
spark.range(100).write.mode("overwrite").parquet(dir)
val df = spark.read.parquet(dir)
df.count // output 100 which is correct
f(df).count // output 89 which is correct

spark.range(1000).write.mode("overwrite").parquet(dir)
val df1 = spark.read.parquet(dir)
df1.count // output 1000 which is correct, in fact other operation expect 
df1.filter("id>10") return correct result.
f(df1).count // output 89 which is incorrect
{code}


> Not use cache on appends and overwrites
> ---
>
> Key: SPARK-15678
> URL: https://issues.apache.org/jira/browse/SPARK-15678
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
> Fix For: 2.0.0
>
>
> SparkSQL currently doesn't drop caches if the underlying data is overwritten.
> {code}
> val dir = "/tmp/test"
> sqlContext.range(1000).write.mode("overwrite").parquet(dir)
> val df = sqlContext.read.parquet(dir).cache()
> df.count() // outputs 1000
> sqlContext.range(10).write.mode("overwrite").parquet(dir)
> sqlContext.read.parquet(dir).count() // outputs 1000 instead of 10 < We 
> are still using the cached dataset
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-02-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877501#comment-15877501
 ] 

Apache Spark commented on SPARK-19525:
--

User 'aramesh117' has created a pull request for this issue:
https://github.com/apache/spark/pull/17024

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19525:


Assignee: Apache Spark

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>Assignee: Apache Spark
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19525) Enable Compression of RDD Checkpoints

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19525:


Assignee: (was: Apache Spark)

> Enable Compression of RDD Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19696) Wrong Documentation for Java Word Count Example

2017-02-21 Thread gaurav gupta (JIRA)
gaurav gupta created SPARK-19696:


 Summary: Wrong Documentation for Java Word Count Example
 Key: SPARK-19696
 URL: https://issues.apache.org/jira/browse/SPARK-19696
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.1.0
Reporter: gaurav gupta


Java Word Count example of http://spark.apache.org/examples.html page is 
defined incorrectly 

'''
JavaRDD textFile = sc.textFile("hdfs://...");
JavaRDD words = textFile.flatMap(s -> Arrays.asList(s.split(" 
")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.saveAsTextFile("hdfs://...");
''

It should be 

'''
JavaRDD textFile = sc.textFile("hdfs://...");
JavaPairRDD counts = textFile.flatMap(s -> 
Arrays.asList(s.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
counts.saveAsTextFile("hdfs://...");




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19695) Throw an exception if a `columnNameOfCorruptRecord` field violates requirements in Json formats

2017-02-21 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-19695:
-
Summary: Throw an exception if a `columnNameOfCorruptRecord` field violates 
requirements in Json formats  (was: Throw an exception if a 
`columnNameOfCorruptRecord` field violates requirements)

> Throw an exception if a `columnNameOfCorruptRecord` field violates 
> requirements in Json formats
> ---
>
> Key: SPARK-19695
> URL: https://issues.apache.org/jira/browse/SPARK-19695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket comes from https://github.com/apache/spark/pull/16928, and fixes 
> a json behaviour along with the CSV one. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19661) Spark-2.1.0 can not connect hbase

2017-02-21 Thread sydt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877441#comment-15877441
 ] 

sydt edited comment on SPARK-19661 at 2/22/17 4:18 AM:
---

Yeah,you are right.It is accomplished by hive-hbase-handler-0.13.1.jar 
actually. In hive, we can create table by
SET hbase.zookeeper.quorum=zkNode1,zkNode2,zkNode3; 
SET zookeeper.znode.parent=/hbase;
ADD jar /usr/local/apache-hive-0.13.1-bin/lib/hive-hbase-handler-0.13.1.jar;
CREATE EXTERNAL TABLE lxw1234 (
rowkey string,
f1 map,
f2 map,
f3 map
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:,f2:,f3:")
TBLPROPERTIES ("hbase.table.name" = "lxw1234");
After  
INSERT INTO TABLE lxw1234 
SELECT 'row1' AS rowkey,
map('c3','name3') AS f1,
map('c3','age3') AS f2,
map('c4','job3') AS f3 
FROM DUAL 
limit 1;
now, hive can access table lxw1234 like ordinary external table and it it 
stored in hbase. Sparksql has the same syntax parser as hive and it should 
support this SQL syntax in spark-1.6.2.However, in spark-2.1.0,it does not. 


was (Author: wangchao2017):
Yeah,you are right.It is accomplished by hive-hbase-handler-0.13.1.jar 
actually. In hive, we can create table by
SET hbase.zookeeper.quorum=zkNode1,zkNode2,zkNode3; 
SET zookeeper.znode.parent=/hbase;
ADD jar /usr/local/apache-hive-0.13.1-bin/lib/hive-hbase-handler-0.13.1.jar;
CREATE EXTERNAL TABLE lxw1234 (
rowkey string,
f1 map,
f2 map,
f3 map
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:,f2:,f3:")
TBLPROPERTIES ("hbase.table.name" = "lxw1234");
After  
INSERT INTO TABLE lxw1234 
SELECT 'row1' AS rowkey,
map('c3','name3') AS f1,
map('c3','age3') AS f2,
map('c4','job3') AS f3 
FROM DUAL 
limit 1;
now, hive can access table lxw1234 like ordinary external table and it it 
stored in hbase. 

> Spark-2.1.0 can not connect hbase
> -
>
> Key: SPARK-19661
> URL: https://issues.apache.org/jira/browse/SPARK-19661
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: sydt
>
> When spark-sql of spark-2.1.0 connect hbase by
> CREATE EXTERNAL TABLE lxw123(  
> rowkey string,  
> f1 map,  
> f2 map,  
> f3 map  
> ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:,f2:,f3:")  
> TBLPROPERTIES ("hbase.table.name" = "lxw1234");
> it has no response and show 
> Error in query: 
> Operation not allowed: STORED BY(line 6, pos 2)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19617) Fix the race condition when starting and stopping a query quickly

2017-02-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19617.
--
   Resolution: Fixed
Fix Version/s: 2.1.1

> Fix the race condition when starting and stopping a query quickly
> -
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> The streaming thread in StreamExecution uses the following ways to check if 
> it should exit:
> - Catch an InterruptException.
> - `StreamExecution.state` is TERMINATED.
> when starting and stopping a query quickly, the above two checks may both 
> fail.
> - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and 
> swallow InterruptException
> - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then 
> [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252]
>  changes the state from `TERMINATED` to `ACTIVE`.
> If the above cases both happen, the query will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19661) Spark-2.1.0 can not connect hbase

2017-02-21 Thread sydt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877441#comment-15877441
 ] 

sydt commented on SPARK-19661:
--

Yeah,you are right.It is accomplished by hive-hbase-handler-0.13.1.jar 
actually. In hive, we can create table by
SET hbase.zookeeper.quorum=zkNode1,zkNode2,zkNode3; 
SET zookeeper.znode.parent=/hbase;
ADD jar /usr/local/apache-hive-0.13.1-bin/lib/hive-hbase-handler-0.13.1.jar;
CREATE EXTERNAL TABLE lxw1234 (
rowkey string,
f1 map,
f2 map,
f3 map
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:,f2:,f3:")
TBLPROPERTIES ("hbase.table.name" = "lxw1234");
After  
INSERT INTO TABLE lxw1234 
SELECT 'row1' AS rowkey,
map('c3','name3') AS f1,
map('c3','age3') AS f2,
map('c4','job3') AS f3 
FROM DUAL 
limit 1;
now, hive can access table lxw1234 like ordinary external table and it it 
stored in hbase. 

> Spark-2.1.0 can not connect hbase
> -
>
> Key: SPARK-19661
> URL: https://issues.apache.org/jira/browse/SPARK-19661
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: sydt
>
> When spark-sql of spark-2.1.0 connect hbase by
> CREATE EXTERNAL TABLE lxw123(  
> rowkey string,  
> f1 map,  
> f2 map,  
> f3 map  
> ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:,f2:,f3:")  
> TBLPROPERTIES ("hbase.table.name" = "lxw1234");
> it has no response and show 
> Error in query: 
> Operation not allowed: STORED BY(line 6, pos 2)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19695) Throw an exception if a `columnNameOfCorruptRecord` field violates requirements

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19695:


Assignee: (was: Apache Spark)

> Throw an exception if a `columnNameOfCorruptRecord` field violates 
> requirements
> ---
>
> Key: SPARK-19695
> URL: https://issues.apache.org/jira/browse/SPARK-19695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket comes from https://github.com/apache/spark/pull/16928, and fixes 
> a json behaviour along with the CSV one. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19695) Throw an exception if a `columnNameOfCorruptRecord` field violates requirements

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19695:


Assignee: Apache Spark

> Throw an exception if a `columnNameOfCorruptRecord` field violates 
> requirements
> ---
>
> Key: SPARK-19695
> URL: https://issues.apache.org/jira/browse/SPARK-19695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This ticket comes from https://github.com/apache/spark/pull/16928, and fixes 
> a json behaviour along with the CSV one. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19695) Throw an exception if a `columnNameOfCorruptRecord` field violates requirements

2017-02-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877440#comment-15877440
 ] 

Apache Spark commented on SPARK-19695:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/17023

> Throw an exception if a `columnNameOfCorruptRecord` field violates 
> requirements
> ---
>
> Key: SPARK-19695
> URL: https://issues.apache.org/jira/browse/SPARK-19695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket comes from https://github.com/apache/spark/pull/16928, and fixes 
> a json behaviour along with the CSV one. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19695) Throw an exception if a `columnNameOfCorruptRecord` field violates requirements

2017-02-21 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-19695:


 Summary: Throw an exception if a `columnNameOfCorruptRecord` field 
violates requirements
 Key: SPARK-19695
 URL: https://issues.apache.org/jira/browse/SPARK-19695
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Takeshi Yamamuro
Priority: Minor


This ticket comes from https://github.com/apache/spark/pull/16928, and fixes a 
json behaviour along with the CSV one. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19694:


Assignee: Apache Spark

> Add missing 'setTopicDistributionCol' for LDAModel
> --
>
> Key: SPARK-19694
> URL: https://issues.apache.org/jira/browse/SPARK-19694
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> {{LDAModel}} can not set the output column now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877417#comment-15877417
 ] 

Apache Spark commented on SPARK-19694:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/17021

> Add missing 'setTopicDistributionCol' for LDAModel
> --
>
> Key: SPARK-19694
> URL: https://issues.apache.org/jira/browse/SPARK-19694
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> {{LDAModel}} can not set the output column now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19694:


Assignee: (was: Apache Spark)

> Add missing 'setTopicDistributionCol' for LDAModel
> --
>
> Key: SPARK-19694
> URL: https://issues.apache.org/jira/browse/SPARK-19694
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Priority: Trivial
>
> {{LDAModel}} can not set the output column now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel

2017-02-21 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-19694:


 Summary: Add missing 'setTopicDistributionCol' for LDAModel
 Key: SPARK-19694
 URL: https://issues.apache.org/jira/browse/SPARK-19694
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: zhengruifeng
Priority: Trivial


{{LDAModel}} can not set the output column now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19670) Enable Bucketed Table Reading and Writing Testing Without Hive Support

2017-02-21 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19670.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Enable Bucketed Table Reading and Writing Testing Without Hive Support
> --
>
> Key: SPARK-19670
> URL: https://issues.apache.org/jira/browse/SPARK-19670
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Bucketed table reading and writing does not need Hive support. We can move 
> the test cases from `sql/hive` to `sql/core`. We can improve the test case 
> coverage. Bucket table reading and writing can be tested with and without 
> Hive support.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19693) SET mapreduce.job.reduces automatically converted to spark.sql.shuffle.partitions

2017-02-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877391#comment-15877391
 ] 

Apache Spark commented on SPARK-19693:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/17020

> SET mapreduce.job.reduces automatically converted to 
> spark.sql.shuffle.partitions
> -
>
> Key: SPARK-19693
> URL: https://issues.apache.org/jira/browse/SPARK-19693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Yuming Wang
>
> How to reproduce:
> {code}
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions200
> Time taken: 7.118 seconds, Fetched 1 row(s)
> spark-sql> set mapred.reduce.tasks=10;
> spark.sql.shuffle.partitions10
> Time taken: 5.968 seconds, Fetched 1 row(s)
> spark-sql> set mapreduce.job.reduces=20;
> set mapreduce.job.reduces   20
> Time taken: 0.233 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions10
> Time taken: 0.099 seconds, Fetched 1 row(s)
> {code}
> The {{SET mapreduce.job.reduces}} should automatically converted to 
> {{spark.sql.shuffle.partitions}}, it's similar to {{set mapred.reduce.tasks}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19693) SET mapreduce.job.reduces automatically converted to spark.sql.shuffle.partitions

2017-02-21 Thread Yuming Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-19693:

Comment: was deleted

(was: I'm working on.)

> SET mapreduce.job.reduces automatically converted to 
> spark.sql.shuffle.partitions
> -
>
> Key: SPARK-19693
> URL: https://issues.apache.org/jira/browse/SPARK-19693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Yuming Wang
>
> How to reproduce:
> {code}
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions200
> Time taken: 7.118 seconds, Fetched 1 row(s)
> spark-sql> set mapred.reduce.tasks=10;
> spark.sql.shuffle.partitions10
> Time taken: 5.968 seconds, Fetched 1 row(s)
> spark-sql> set mapreduce.job.reduces=20;
> set mapreduce.job.reduces   20
> Time taken: 0.233 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions10
> Time taken: 0.099 seconds, Fetched 1 row(s)
> {code}
> The {{SET mapreduce.job.reduces}} should automatically converted to 
> {{spark.sql.shuffle.partitions}}, it's similar to {{set mapred.reduce.tasks}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19693) SET mapreduce.job.reduces automatically converted to spark.sql.shuffle.partitions

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19693:


Assignee: (was: Apache Spark)

> SET mapreduce.job.reduces automatically converted to 
> spark.sql.shuffle.partitions
> -
>
> Key: SPARK-19693
> URL: https://issues.apache.org/jira/browse/SPARK-19693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Yuming Wang
>
> How to reproduce:
> {code}
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions200
> Time taken: 7.118 seconds, Fetched 1 row(s)
> spark-sql> set mapred.reduce.tasks=10;
> spark.sql.shuffle.partitions10
> Time taken: 5.968 seconds, Fetched 1 row(s)
> spark-sql> set mapreduce.job.reduces=20;
> set mapreduce.job.reduces   20
> Time taken: 0.233 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions10
> Time taken: 0.099 seconds, Fetched 1 row(s)
> {code}
> The {{SET mapreduce.job.reduces}} should automatically converted to 
> {{spark.sql.shuffle.partitions}}, it's similar to {{set mapred.reduce.tasks}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19693) SET mapreduce.job.reduces automatically converted to spark.sql.shuffle.partitions

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19693:


Assignee: Apache Spark

> SET mapreduce.job.reduces automatically converted to 
> spark.sql.shuffle.partitions
> -
>
> Key: SPARK-19693
> URL: https://issues.apache.org/jira/browse/SPARK-19693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>
> How to reproduce:
> {code}
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions200
> Time taken: 7.118 seconds, Fetched 1 row(s)
> spark-sql> set mapred.reduce.tasks=10;
> spark.sql.shuffle.partitions10
> Time taken: 5.968 seconds, Fetched 1 row(s)
> spark-sql> set mapreduce.job.reduces=20;
> set mapreduce.job.reduces   20
> Time taken: 0.233 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions10
> Time taken: 0.099 seconds, Fetched 1 row(s)
> {code}
> The {{SET mapreduce.job.reduces}} should automatically converted to 
> {{spark.sql.shuffle.partitions}}, it's similar to {{set mapred.reduce.tasks}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19693) SET mapreduce.job.reduces automatically converted to spark.sql.shuffle.partitions

2017-02-21 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877382#comment-15877382
 ] 

Yuming Wang commented on SPARK-19693:
-

I'm working on.

> SET mapreduce.job.reduces automatically converted to 
> spark.sql.shuffle.partitions
> -
>
> Key: SPARK-19693
> URL: https://issues.apache.org/jira/browse/SPARK-19693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Yuming Wang
>
> How to reproduce:
> {code}
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions200
> Time taken: 7.118 seconds, Fetched 1 row(s)
> spark-sql> set mapred.reduce.tasks=10;
> spark.sql.shuffle.partitions10
> Time taken: 5.968 seconds, Fetched 1 row(s)
> spark-sql> set mapreduce.job.reduces=20;
> set mapreduce.job.reduces   20
> Time taken: 0.233 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.shuffle.partitions;
> spark.sql.shuffle.partitions10
> Time taken: 0.099 seconds, Fetched 1 row(s)
> {code}
> The {{SET mapreduce.job.reduces}} should automatically converted to 
> {{spark.sql.shuffle.partitions}}, it's similar to {{set mapred.reduce.tasks}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19693) SET mapreduce.job.reduces automatically converted to spark.sql.shuffle.partitions

2017-02-21 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-19693:
---

 Summary: SET mapreduce.job.reduces automatically converted to 
spark.sql.shuffle.partitions
 Key: SPARK-19693
 URL: https://issues.apache.org/jira/browse/SPARK-19693
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0, 2.2.0
Reporter: Yuming Wang


How to reproduce:
{code}
spark-sql> set spark.sql.shuffle.partitions;
spark.sql.shuffle.partitions200
Time taken: 7.118 seconds, Fetched 1 row(s)
spark-sql> set mapred.reduce.tasks=10;
spark.sql.shuffle.partitions10
Time taken: 5.968 seconds, Fetched 1 row(s)
spark-sql> set mapreduce.job.reduces=20;
set mapreduce.job.reduces   20
Time taken: 0.233 seconds, Fetched 1 row(s)
spark-sql> set spark.sql.shuffle.partitions;
spark.sql.shuffle.partitions10
Time taken: 0.099 seconds, Fetched 1 row(s)
{code}
The {{SET mapreduce.job.reduces}} should automatically converted to 
{{spark.sql.shuffle.partitions}}, it's similar to {{set mapred.reduce.tasks}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877350#comment-15877350
 ] 

Kohki Nishio commented on SPARK-19675:
--

Turned out that this is a dup of 
https://issues.apache.org/jira/browse/SPARK-18646

I'll work on PR

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19692) Comparison on BinaryType has incorrect results

2017-02-21 Thread Don Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Smith  updated SPARK-19692:
---
Summary: Comparison on BinaryType has incorrect results  (was: Comparison 
on BinaryType returns no results)

> Comparison on BinaryType has incorrect results
> --
>
> Key: SPARK-19692
> URL: https://issues.apache.org/jira/browse/SPARK-19692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Smith 
>
> I believe there is an issue with comparisons on binary fields:
> {code}
>   val sc = SparkSession.builder.appName("test").getOrCreate()
>   val schema = StructType(Seq(StructField("ip", BinaryType)))
>   val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => 
> InetAddress.getByName(s).getAddress)
>   val df = sc.createDataFrame(
> sc.sparkContext.parallelize(ips, 1).map { ip =>
>   Row(ip)
> }, schema
>   )
>   val query = df
> .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress)
> .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress)
>   logger.info(query.explain(true))
>   val results = query.collect()
>   results.length mustEqual 1
> {code}
> returns no results.
> i believe the problem is that the comparison is coercing the bytes to signed 
> integers in the call to compareTo here in TypeUtils: 
> {code}
>   def compareBinary(x: Array[Byte], y: Array[Byte]): Int = {
> for (i <- 0 until x.length; if i < y.length) {
>   val res = x(i).compareTo(y(i))
>   if (res != 0) return res
> }
> x.length - y.length
>   }
> {code}
> with some hacky testing i was able to get the desired results with: {code} 
> val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code}
> thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19692) Comparison on BinaryType returns no results

2017-02-21 Thread Don Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Don Smith  updated SPARK-19692:
---
Description: 
I believe there is an issue with comparisons on binary fields:
{code}
  val sc = SparkSession.builder.appName("test").getOrCreate()
  val schema = StructType(Seq(StructField("ip", BinaryType)))

  val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => 
InetAddress.getByName(s).getAddress)

  val df = sc.createDataFrame(
sc.sparkContext.parallelize(ips, 1).map { ip =>
  Row(ip)
}, schema
  )

  val query = df
.where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress)
.where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress)

  logger.info(query.explain(true))
  val results = query.collect()
  results.length mustEqual 1
{code}

returns no results.
i believe the problem is that the comparison is coercing the bytes to signed 
integers in the call to compareTo here in TypeUtils: 
{code}
  def compareBinary(x: Array[Byte], y: Array[Byte]): Int = {
for (i <- 0 until x.length; if i < y.length) {
  val res = x(i).compareTo(y(i))
  if (res != 0) return res
}
x.length - y.length
  }
{code}

with some hacky testing i was able to get the desired results with: {code} val 
res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code}

thanks!

  was:
I believe there is an issue with comparisons on binary fields:
{code}
  val sc = SparkSession.builder.appName("test").getOrCreate()
  val schema = StructType(Seq(StructField("ip", BinaryType)))

  val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => 
InetAddress.getByName(s).getAddress)

  val df = sc.createDataFrame(
sc.sparkContext.parallelize(ips, 1).map { ip =>
  Row(ip)
}, schema
  )

  val query = df
.where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress)
.where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress)

  logger.info(query.explain(true))
  val results = query.collect()
  results.length mustEqual 1
{code}

returns no results.
i believe the problem is that the comparison is coercing the bytes to signed 
integers in the call to compareTo here in TypeUtils: 
{code}
  def compareBinary(x: Array[Byte], y: Array[Byte]): Int = {
for (i <- 0 until x.length; if i < y.length) {
  val res = x(i).compareTo(y(i))
  if (res != 0) return res
}
x.length - y.length
  }
{code}

with some hacky testing i was able to get the desired results with: {{ val res 
= (x(i).toByte & 0xff) - (y(i).toByte & 0xff) }}

thanks!


> Comparison on BinaryType returns no results
> ---
>
> Key: SPARK-19692
> URL: https://issues.apache.org/jira/browse/SPARK-19692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Smith 
>
> I believe there is an issue with comparisons on binary fields:
> {code}
>   val sc = SparkSession.builder.appName("test").getOrCreate()
>   val schema = StructType(Seq(StructField("ip", BinaryType)))
>   val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => 
> InetAddress.getByName(s).getAddress)
>   val df = sc.createDataFrame(
> sc.sparkContext.parallelize(ips, 1).map { ip =>
>   Row(ip)
> }, schema
>   )
>   val query = df
> .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress)
> .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress)
>   logger.info(query.explain(true))
>   val results = query.collect()
>   results.length mustEqual 1
> {code}
> returns no results.
> i believe the problem is that the comparison is coercing the bytes to signed 
> integers in the call to compareTo here in TypeUtils: 
> {code}
>   def compareBinary(x: Array[Byte], y: Array[Byte]): Int = {
> for (i <- 0 until x.length; if i < y.length) {
>   val res = x(i).compareTo(y(i))
>   if (res != 0) return res
> }
> x.length - y.length
>   }
> {code}
> with some hacky testing i was able to get the desired results with: {code} 
> val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code}
> thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19692) Comparison on BinaryType returns no results

2017-02-21 Thread Don Smith (JIRA)
Don Smith  created SPARK-19692:
--

 Summary: Comparison on BinaryType returns no results
 Key: SPARK-19692
 URL: https://issues.apache.org/jira/browse/SPARK-19692
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Don Smith 


I believe there is an issue with comparisons on binary fields:
{code}
  val sc = SparkSession.builder.appName("test").getOrCreate()
  val schema = StructType(Seq(StructField("ip", BinaryType)))

  val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => 
InetAddress.getByName(s).getAddress)

  val df = sc.createDataFrame(
sc.sparkContext.parallelize(ips, 1).map { ip =>
  Row(ip)
}, schema
  )

  val query = df
.where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress)
.where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress)

  logger.info(query.explain(true))
  val results = query.collect()
  results.length mustEqual 1
{code}

returns no results.
i believe the problem is that the comparison is coercing the bytes to signed 
integers in the call to compareTo here in TypeUtils: 
{code}
  def compareBinary(x: Array[Byte], y: Array[Byte]): Int = {
for (i <- 0 until x.length; if i < y.length) {
  val res = x(i).compareTo(y(i))
  if (res != 0) return res
}
x.length - y.length
  }
{code}

with some hacky testing i was able to get the desired results with: {{ val res 
= (x(i).toByte & 0xff) - (y(i).toByte & 0xff) }}

thanks!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19676) Flaky test: FsHistoryProviderSuite.SPARK-3697: ignore directories that cannot be read.

2017-02-21 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19676.

Resolution: Invalid

Reporter said he was running tests as root.

> Flaky test: FsHistoryProviderSuite.SPARK-3697: ignore directories that cannot 
> be read.
> --
>
> Key: SPARK-19676
> URL: https://issues.apache.org/jira/browse/SPARK-19676
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877290#comment-15877290
 ] 

Kohki Nishio edited comment on SPARK-19675 at 2/22/17 2:11 AM:
---

not in a separate process, I'm running it in local mode, so executor lives 
within the same JVM, that's why it's picking up classes from sbt.

This ExecutorClassLoader's actual 'parentLoader' is the one handles REPL 
(remote class loader), however it's backed by its local class loader
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L580

So I believe it is the right thing to load from there. For most of remote 
situations, that class loader (spark class loader) is the system class loader 
itself, that's why it doesn't cause problems. However my situation is local 
mode and SBT doesn't use class path, it just sets up class loader tree.



was (Author: taroplus):
not in a separate process, I'm running it in local mode, so executor lives 
within the same JVM, that's why it's picking up classes from sbt.

This ExecutorClassLoader's actual 'parentLoader' is the one handles REPL 
(remote class loader), however it's backed by its local class loader
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L580

So I believe it is the right thing to load from there.


> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877290#comment-15877290
 ] 

Kohki Nishio commented on SPARK-19675:
--

not in a separate process, I'm running it in local mode, so executor lives 
within the same JVM, that's why it's picking up classes from sbt.

This ExecutorClassLoader's actual 'parentLoader' is the one handles REPL 
(remote class loader), however it's backed by its local class loader
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L580

So I believe it is the right thing to load from there.


> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877273#comment-15877273
 ] 

Kohki Nishio edited comment on SPARK-19675 at 2/22/17 2:00 AM:
---

[~zsxwing] That's only when findClass is used, it doesn't do it for loadClass, 
(Class.forName comes to loadClass, it seems)


was (Author: taroplus):
[~zsxwing] That's only when findClass is used, it doesn't do it for loadClass

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877276#comment-15877276
 ] 

Shixiong Zhu commented on SPARK-19675:
--

Just to clarify one thing: Executors in your case will run in a separate 
process and know nothing about the magic class loader in the driver.

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877273#comment-15877273
 ] 

Kohki Nishio commented on SPARK-19675:
--

[~zsxwing] That's only when findClass is used, it doesn't do it for loadClass

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877262#comment-15877262
 ] 

Shixiong Zhu edited comment on SPARK-19675 at 2/22/17 1:59 AM:
---

[~taroplus] ExecutorClassLoader does try to load from its parent classloader: 
https://github.com/apache/spark/blob/master/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L77
The issue you reported is that it should load classes from the remote driver 
instead of the current parent classloader. Right?


was (Author: zsxwing):
[~taroplus] ExecutorClassLoader does try to load from its parent classloader: 
https://github.com/apache/spark/blob/master/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L77
The issue you reported is that it should load from the remote driver instead of 
the current parent classloader. Right?

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877256#comment-15877256
 ] 

Kohki Nishio edited comment on SPARK-19675 at 2/22/17 1:58 AM:
---

My application does notebook thing, this is about Zeppelin could not load 
resource

https://issues.apache.org/jira/browse/SPARK-11818

Also you can see the same error message in a different notebook project here

https://github.com/andypetrella/spark-notebook/issues/346

I would say this is not really a strange use case especially in a notebook 
situation where REPL does its own class loader thing. This is an issue, 
actually it is blocking me and there's no solution.

why you think this is going to break Yarn situation ? loadClass can try its 
class loader first then do the original, trying the given class loader first is 
a right thing to do.

{code}
try { parentLoader.loadClass(name) }
catch { super.loadClass(name) }
{code}


was (Author: taroplus):
My application does notebook thing, this is about Zeppelin could not load 
resource

https://issues.apache.org/jira/browse/SPARK-11818

Also you can see the same error message in a different notebook project here

https://github.com/andypetrella/spark-notebook/issues/346

I would say this is not really a strange use case especially in a notebook 
situation where REPL does its own class loader thing. This is an issue, 
actually it is blocking me and there's no solution.

why you think this is going to break Yarn situation ? loadClass can try its 
class loader first then do the original, so given class loader first is a right 
thing to do.

{code}
try { parentLoader.loadClass(name) }
catch { super.loadClass(name) }
{code}

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877262#comment-15877262
 ] 

Shixiong Zhu commented on SPARK-19675:
--

[~taroplus] ExecutorClassLoader does try to load from its parent classloader: 
https://github.com/apache/spark/blob/master/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L77
The issue you reported is that it should load from the remote driver instead of 
the current parent classloader. Right?

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877256#comment-15877256
 ] 

Kohki Nishio edited comment on SPARK-19675 at 2/22/17 1:54 AM:
---

My application does notebook thing, this is about Zeppelin could not load 
resource

https://issues.apache.org/jira/browse/SPARK-11818

Also you can see the same error message in a different notebook project here

https://github.com/andypetrella/spark-notebook/issues/346

I would say this is not really a strange use case especially in a notebook 
situation where REPL does its own class loader thing. This is an issue, 
actually it is blocking me and there's no solution.

why you think this is going to break Yarn situation ? loadClass can try its 
class loader first then do the original, so given class loader first is a right 
thing to do.

{code}
try { parentLoader.loadClass(name) }
catch { super.loadClass(name) }
{code}


was (Author: taroplus):
My application does notebook thing, this is about Zeppelin could not load 
resource

https://issues.apache.org/jira/browse/SPARK-11818

Also you can see the same error message in a notebook project here

https://github.com/andypetrella/spark-notebook/issues/346

I would say this is not really a strange use case especially in a notebook 
situation where REPL does its own class loader thing. This is an issue, 
actually it is blocking me and there's no solution.

why you think this is going to break Yarn situation ? loadClass can try its 
class loader first then do the original, so given class loader first is a right 
thing to do.

{code}
try { parentLoader.loadClass(name) }
catch { super.loadClass(name) }
{code}

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877255#comment-15877255
 ] 

Shixiong Zhu edited comment on SPARK-19675 at 2/22/17 1:53 AM:
---

[~taroplus] Yeah, I should have checked `sbt run` with myself. Looks like it 
does use some class loader magic to support multiple Scala versions.

However, I don't see how to do this in Spark. In general, the recommended way 
to run Spark application is using spark-submit rather than `sbt run`.

I checked the local mode and it works. So the issue happens when launching 
executor processes in local-cluster, client and cluster modes.

There are couples problems to support `sbt run`. E.g.,

- How to know if the driver is using `sbt run`?
- How to launch a new executor process with the SBT ClassLoader automatically 
in a remote node? The remote node may not have SBT installed.

This is really a low priority feature. Welcome to reopen this ticket and submit 
a PR if you have time to work on this.


was (Author: zsxwing):
[~taroplus] Yeah, I should have checked `sbt run` with myself. Looks like it 
does use some class loader magic to support multiple Scala versions.

However, I don't see how to do this in Spark. In general, the recommended way 
to run Spark application is using spark-submit rather than `sbt run`.

I checked the local mode and it works. So the issue happens when launching 
executor processes in local-cluster, client and cluster modes.

There are couples problems to support `sbt run`. E.g.,

- How to know if the driver is using `sbt run`?
- How to launch a new executor process with the SBT ClassLoader automatically 
in a remote node? The remote node may not have SBT installed.

This is really a low priority feature. Welcome to submit a PR if you have time 
to work on this.

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877256#comment-15877256
 ] 

Kohki Nishio commented on SPARK-19675:
--

My application does notebook thing, this is about Zeppelin could load resource

https://issues.apache.org/jira/browse/SPARK-11818

Also you can see the same error message in a notebook project here

https://github.com/andypetrella/spark-notebook/issues/346

I would say this is not really a strange use case especially in a notebook 
situation where REPL does its own class loader thing. This is an issue, 
actually it is blocking me and there's no solution.

why you think this is going to break Yarn situation ? loadClass can try its 
class loader first then do the original, so given class loader first is a right 
thing to do.

{code}
try { parentLoader.loadClass(name) }
catch { super.loadClass(name) }
{code}

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877256#comment-15877256
 ] 

Kohki Nishio edited comment on SPARK-19675 at 2/22/17 1:54 AM:
---

My application does notebook thing, this is about Zeppelin could not load 
resource

https://issues.apache.org/jira/browse/SPARK-11818

Also you can see the same error message in a notebook project here

https://github.com/andypetrella/spark-notebook/issues/346

I would say this is not really a strange use case especially in a notebook 
situation where REPL does its own class loader thing. This is an issue, 
actually it is blocking me and there's no solution.

why you think this is going to break Yarn situation ? loadClass can try its 
class loader first then do the original, so given class loader first is a right 
thing to do.

{code}
try { parentLoader.loadClass(name) }
catch { super.loadClass(name) }
{code}


was (Author: taroplus):
My application does notebook thing, this is about Zeppelin could load resource

https://issues.apache.org/jira/browse/SPARK-11818

Also you can see the same error message in a notebook project here

https://github.com/andypetrella/spark-notebook/issues/346

I would say this is not really a strange use case especially in a notebook 
situation where REPL does its own class loader thing. This is an issue, 
actually it is blocking me and there's no solution.

why you think this is going to break Yarn situation ? loadClass can try its 
class loader first then do the original, so given class loader first is a right 
thing to do.

{code}
try { parentLoader.loadClass(name) }
catch { super.loadClass(name) }
{code}

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877255#comment-15877255
 ] 

Shixiong Zhu commented on SPARK-19675:
--

[~taroplus] Yeah, I should have checked `sbt run` with myself. Looks like it 
does use some class loader magic to support multiple Scala versions.

However, I don't see how to do this in Spark. In general, the recommended way 
to run Spark application is using spark-submit rather than `sbt run`.

I checked the local mode and it works. So the issue happens when launching 
executor processes in local-cluster, client and cluster modes.

There are couples problems to support `sbt run`. E.g.,

- How to know if the driver is using `sbt run`?
- How to launch a new executor process with the SBT ClassLoader automatically 
in a remote node? The remote node may not have SBT installed.

This is really a low priority feature. Welcome to submit a PR if you have time 
to work on this.

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19675:
-
Issue Type: Improvement  (was: Bug)

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13767) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server

2017-02-21 Thread yb (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877251#comment-15877251
 ] 

yb commented on SPARK-13767:


i use spark 1.6.2 python 2.7
I'm seeing the error that Venkata showed as well, if anyone has any thoughts on 
why that would occur, I'd really appreciate it.
error like follow:
Traceback (most recent call last):
  File "E:/scrapy_workspace/testddd/dds/SparkTest.py", line 10, in 
conf=SparkConf().setAppName("pySparkDemo").setMaster("local")
  File "D:\Python27\lib\site-packages\pyspark\conf.py", line 104, in __init__
SparkContext._ensure_initialized()
  File "D:\Python27\lib\site-packages\pyspark\context.py", line 245, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "D:\Python27\lib\site-packages\pyspark\java_gateway.py", line 116, in 
launch_gateway
java_import(gateway.jvm, "org.apache.spark.SparkConf")
  File "D:\Python27\lib\site-packages\py4j\java_gateway.py", line 79, in 
java_import
answer = gateway_client.send_command(command)
  File "D:\Python27\lib\site-packages\py4j\java_gateway.py", line 624, in 
send_command
connection = self._get_connection()
  File "D:\Python27\lib\site-packages\py4j\java_gateway.py", line 579, in 
_get_connection
connection = self._create_connection()
  File "D:\Python27\lib\site-packages\py4j\java_gateway.py", line 585, in 
_create_connection
connection.start()
  File "D:\Python27\lib\site-packages\py4j\java_gateway.py", line 697, in start
raise Py4JNetworkError(msg, e)
py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
the Java server

> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> 
>
> Key: SPARK-13767
> URL: https://issues.apache.org/jira/browse/SPARK-13767
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Poonam Agrawal
>
> I am trying to create spark context object with the following commands on 
> pyspark:
> from pyspark import SparkContext, SparkConf
> conf = 
> SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host',
>  'cassandra-machine-ip').set('spark.storage.memoryFraction', 
> '0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', 
> 500).set('spark.serializer', 
> 'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', 
> 'FAIR').set('spark.mesos.coarse', 'true')
> sc = SparkContext(conf=conf)
> but I am getting the following error:
> Traceback (most recent call last):
> File "", line 1, in 
> File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in 
> __init__
>   self._jconf = _jvm.SparkConf(loadDefaults)
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 766, in __getattr__
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 362, in send_command
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 318, in _get_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 325, in _create_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 432, in start
> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> I am getting the same error executing the command : conf = 
> SparkConf().setAppName("App_name").setMaster("spark://127.0.0.1:7077")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877239#comment-15877239
 ] 

Sean Owen commented on SPARK-19675:
---

No, this isn't even a supported way to run an executor. None of this 
contemplates what it would change or break in actual deployments like on YARN. 
As I've indicated twice this is not a problem for Spark. At least none is 
argues here. 

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877235#comment-15877235
 ] 

Kohki Nishio commented on SPARK-19675:
--

I don't think this is something clearly a wrong bug report, can we keep this 
opened until we figure out ?

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877234#comment-15877234
 ] 

Kohki Nishio commented on SPARK-19675:
--

Overriding loadClass fixes my scenario

{code}
import java.net._
import org.apache.spark._
import org.apache.spark.repl._

val desiredLoader = new URLClassLoader(
   Array(new URL("file:/tmp/scala-library-2.11.0.jar")),
   null)

val executorLoader = new ExecutorClassLoader(
  new SparkConf(),
  null,
  "",
  desiredLoader,
  false) {

override def loadClass(name: String) : Class[_] = {
parentLoader.loadClass(name)
}
}

Class.forName("scala.Option", false, executorLoader).getClassLoader()
{code}

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19691) Calculating percentile of decimal column fails with ClassCastException

2017-02-21 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-19691:
--

 Summary: Calculating percentile of decimal column fails with 
ClassCastException
 Key: SPARK-19691
 URL: https://issues.apache.org/jira/browse/SPARK-19691
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Josh Rosen


Running

{code}
spark.range(10).selectExpr("cast (id as decimal) as 
x").selectExpr("percentile(x, 0.5)").collect()
{code}

results in a ClassCastException:

{code}
 java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
cast to java.lang.Number
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:78)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Kohki Nishio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877219#comment-15877219
 ] 

Kohki Nishio commented on SPARK-19675:
--

First of all, sbt (0.13) is built with Scala2.10 and it can handle 2.11 
application

When launching an application from "sbt run", it sets up all of necessary class 
loader tree without having System class loader in the tree but within the same 
JVM. This is why it doesn't cause any problem inside the application for most 
of time.

However in Spark, it creates ExecutorClassLoader and its parent class: 
ClassLoader's default constructor is to grab SystemClassLoader and set it as 
its parent which is the problem here. That's why that class has different path 
to load a class.

Since the application uses the class loader which is setup by SBT, it uses 
classes from there and serializing objects, however deserializing in the worker 
thread will try to load a class form System Class Loader. 

I believe ExecutorClassLoader should always try to load classes from given 
class loader, otherwise there's a chance to see java.io.InvalidClassException

> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19690) Join a streaming DataFrame with a batch DataFrame may not work

2017-02-21 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19690:


 Summary: Join a streaming DataFrame with a batch DataFrame may not 
work
 Key: SPARK-19690
 URL: https://issues.apache.org/jira/browse/SPARK-19690
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0, 2.0.3, 2.1.1
Reporter: Shixiong Zhu


When joining a streaming DataFrame with a batch DataFrame, if the batch 
DataFrame has an aggregation, it will be converted to a streaming physical 
aggregation. Then the query will crash.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19689) Job Details page doesn't show 'Tasks: Succeeded/Total' progress bar text properly

2017-02-21 Thread Devaraj K (JIRA)
Devaraj K created SPARK-19689:
-

 Summary: Job Details page doesn't show 'Tasks: Succeeded/Total' 
progress bar text properly
 Key: SPARK-19689
 URL: https://issues.apache.org/jira/browse/SPARK-19689
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.1.0
Reporter: Devaraj K
Priority: Minor
 Attachments: Tasks Progress bar - Job Details Page.png

In Failed Stages table, 'Tasks: Succeeded/Total' value is displaying properly 
when there is a Failure Reason with some multi-line text.

Please find the attached screen shot for more details.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19689) Job Details page doesn't show 'Tasks: Succeeded/Total' progress bar text properly

2017-02-21 Thread Devaraj K (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated SPARK-19689:
--
Attachment: Tasks Progress bar - Job Details Page.png

> Job Details page doesn't show 'Tasks: Succeeded/Total' progress bar text 
> properly
> -
>
> Key: SPARK-19689
> URL: https://issues.apache.org/jira/browse/SPARK-19689
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Devaraj K
>Priority: Minor
> Attachments: Tasks Progress bar - Job Details Page.png
>
>
> In Failed Stages table, 'Tasks: Succeeded/Total' value is displaying properly 
> when there is a Failure Reason with some multi-line text.
> Please find the attached screen shot for more details.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-21 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877128#comment-15877128
 ] 

Timothy Hunter commented on SPARK-19636:


Looking more closely at the code, it makes sense to start by a replacement of 
MultivariateStatisticalSummary, which is the basis of  PearsonCorrelation and 
the final step of the Spearman correlation. Also, looking at these algorithms, 
it is not going to write them as UDAFs (unlike the original design), so the 
interface will need to take a {{Dataset[Vector]}} instead of a column.

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19652) REST API does not perform user auth for individual apps

2017-02-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877073#comment-15877073
 ] 

Apache Spark commented on SPARK-19652:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/17019

> REST API does not perform user auth for individual apps
> ---
>
> Key: SPARK-19652
> URL: https://issues.apache.org/jira/browse/SPARK-19652
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Marcelo Vanzin
>
> (This goes back further than 2.0.0, btw.)
> The REST API currently only performs authorization at the root of the UI; 
> this works for live UIs, but not for the history server, where the root 
> allows everybody to read data. That means that currently any user can see any 
> application in the SHS through the REST API, when auth is enabled.
> Instead, the REST API should behave like the regular UI and perform 
> authentication at the app level too.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19681) save and load pipeline and then use it yield java.lang.RuntimeException

2017-02-21 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-19681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877033#comment-15877033
 ] 

Boris Clémençon  commented on SPARK-19681:
--

EDIT:
the problem is in Compare.assertDataFrameEquals, not in Spark. I close the 
ticket.

> save and load pipeline and then use it yield java.lang.RuntimeException
> ---
>
> Key: SPARK-19681
> URL: https://issues.apache.org/jira/browse/SPARK-19681
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Boris Clémençon 
>  Labels: spark-ml
>
> Here is the unit test that fails:
> import org.apache.spark.SparkConf
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.classification.LogisticRegression
> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> import org.apache.spark.ml.feature.{SQLTransformer, VectorAssembler}
> import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, 
> ParamGridBuilder}
> import org.apache.spark.sql.{DataFrame, SparkSession}
> import org.scalatest.{BeforeAndAfter, FlatSpec, Matchers}
> import scala.util.Random
> /**
>   * Created by borisclemencon on 21/02/2017.
>   */
> class PipelineTest extends FlatSpec with Matchers with BeforeAndAfter {
>   val featuresCol = "features"
>   val responseCol = "response"
>   val weightCol = "weight"
>   val features = Array("X1", "X2")
>   val lambdas = Array(0.01)
>   val alpha = 0.2
>   val maxIter = 50
>   val nfolds = 5
>   var spark: SparkSession = _
>   before {
> val sparkConf: SparkConf = new SparkConf().
>   set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
>   set("spark.ui.enabled", "false"). // faster and remove 'spark test 
> java.net.BindException: Address already in use' warnings!
>   set("spark.driver.host", "127.0.0.1")
> spark = SparkSession.
>   builder().
>   config(sparkConf).
>   appName("BlendWeightTransformerTest").
>   master("local[*]").
>   getOrCreate()
>   }
>   def makeDataset(n: Int = 100): DataFrame = {
> val sc = spark
> import sc.implicits._
> val n = 1000
> val data =
>   for (i <- 1 to n) yield {
> val pn = if (Random.nextDouble() < 0.1) "a" else "b"
> val x1: Double = Random.nextGaussian() * 5
> val x2: Double = Random.nextGaussian() * 2
> val response: Int = if (Random.nextBoolean()) 1 else 0
> (pn, x1, x2, response)
>   }
> data.toDF(packageNameCol, "X1", "X2", responseCol)
>   }
>   "load()" should "produce the same pipeline and result before and after 
> save()" in {
> val lr = new LogisticRegression().
>   setFitIntercept(true).
>   setMaxIter(maxIter).
>   setElasticNetParam(alpha).
>   setStandardization(true).
>   setFamily("binomial").
>   setFeaturesCol(featuresCol).
>   setLabelCol(responseCol)
> val assembler = new 
> VectorAssembler().setInputCols(features).setOutputCol(featuresCol)
> val pipeline = new Pipeline().setStages(Array(assembler, lr))
> val evaluator = new BinaryClassificationEvaluator().
>   setLabelCol(responseCol).
>   setMetricName("areaUnderROC")
> val paramGrid = new ParamGridBuilder().
>   addGrid(lr.regParam, lambdas).
>   build()
> // Train with simple grid cross validation
> val cv = new CrossValidator().
>   setEstimator(pipeline).
>   setEvaluator(evaluator).
>   setEstimatorParamMaps(paramGrid).
>   setNumFolds(nfolds) // Use 3+ in practice
> val df = makeDataset(100).cache
> val cvModel = cv.fit(df)
> val answer = cvModel.transform(df)
> answer.show(truncate = false)
> val path = "./PipelineTestcvModel"
> cvModel.write.overwrite().save(path)
> val cvModelLoaded = CrossValidatorModel.load(path)
> val output = cvModelLoaded.transform(df)
> output.show(truncate = false)
> Compare.assertDataFrameEquals(answer, output)
>   }
> }
> yield exception
> should produce the same blent pipeline and result before and after save() *** 
> FAILED ***
> [info]   java.lang.RuntimeException: no default for type 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
> [info]   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:121)
> [info]   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:114)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> [info]   at scala.collection.immutable.List.foreach(List.scala:381)
> [info]   

[jira] [Closed] (SPARK-19681) save and load pipeline and then use it yield java.lang.RuntimeException

2017-02-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-19681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Boris Clémençon  closed SPARK-19681.

Resolution: Fixed

> save and load pipeline and then use it yield java.lang.RuntimeException
> ---
>
> Key: SPARK-19681
> URL: https://issues.apache.org/jira/browse/SPARK-19681
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Boris Clémençon 
>  Labels: spark-ml
>
> Here is the unit test that fails:
> import org.apache.spark.SparkConf
> import org.apache.spark.ml.Pipeline
> import org.apache.spark.ml.classification.LogisticRegression
> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> import org.apache.spark.ml.feature.{SQLTransformer, VectorAssembler}
> import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, 
> ParamGridBuilder}
> import org.apache.spark.sql.{DataFrame, SparkSession}
> import org.scalatest.{BeforeAndAfter, FlatSpec, Matchers}
> import scala.util.Random
> /**
>   * Created by borisclemencon on 21/02/2017.
>   */
> class PipelineTest extends FlatSpec with Matchers with BeforeAndAfter {
>   val featuresCol = "features"
>   val responseCol = "response"
>   val weightCol = "weight"
>   val features = Array("X1", "X2")
>   val lambdas = Array(0.01)
>   val alpha = 0.2
>   val maxIter = 50
>   val nfolds = 5
>   var spark: SparkSession = _
>   before {
> val sparkConf: SparkConf = new SparkConf().
>   set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
>   set("spark.ui.enabled", "false"). // faster and remove 'spark test 
> java.net.BindException: Address already in use' warnings!
>   set("spark.driver.host", "127.0.0.1")
> spark = SparkSession.
>   builder().
>   config(sparkConf).
>   appName("BlendWeightTransformerTest").
>   master("local[*]").
>   getOrCreate()
>   }
>   def makeDataset(n: Int = 100): DataFrame = {
> val sc = spark
> import sc.implicits._
> val n = 1000
> val data =
>   for (i <- 1 to n) yield {
> val pn = if (Random.nextDouble() < 0.1) "a" else "b"
> val x1: Double = Random.nextGaussian() * 5
> val x2: Double = Random.nextGaussian() * 2
> val response: Int = if (Random.nextBoolean()) 1 else 0
> (pn, x1, x2, response)
>   }
> data.toDF(packageNameCol, "X1", "X2", responseCol)
>   }
>   "load()" should "produce the same pipeline and result before and after 
> save()" in {
> val lr = new LogisticRegression().
>   setFitIntercept(true).
>   setMaxIter(maxIter).
>   setElasticNetParam(alpha).
>   setStandardization(true).
>   setFamily("binomial").
>   setFeaturesCol(featuresCol).
>   setLabelCol(responseCol)
> val assembler = new 
> VectorAssembler().setInputCols(features).setOutputCol(featuresCol)
> val pipeline = new Pipeline().setStages(Array(assembler, lr))
> val evaluator = new BinaryClassificationEvaluator().
>   setLabelCol(responseCol).
>   setMetricName("areaUnderROC")
> val paramGrid = new ParamGridBuilder().
>   addGrid(lr.regParam, lambdas).
>   build()
> // Train with simple grid cross validation
> val cv = new CrossValidator().
>   setEstimator(pipeline).
>   setEvaluator(evaluator).
>   setEstimatorParamMaps(paramGrid).
>   setNumFolds(nfolds) // Use 3+ in practice
> val df = makeDataset(100).cache
> val cvModel = cv.fit(df)
> val answer = cvModel.transform(df)
> answer.show(truncate = false)
> val path = "./PipelineTestcvModel"
> cvModel.write.overwrite().save(path)
> val cvModelLoaded = CrossValidatorModel.load(path)
> val output = cvModelLoaded.transform(df)
> output.show(truncate = false)
> Compare.assertDataFrameEquals(answer, output)
>   }
> }
> yield exception
> should produce the same blent pipeline and result before and after save() *** 
> FAILED ***
> [info]   java.lang.RuntimeException: no default for type 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
> [info]   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:121)
> [info]   at 
> org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:114)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> [info]   at scala.collection.immutable.List.foreach(List.scala:381)
> [info]   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> [info]   at 

[jira] [Commented] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-21 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876979#comment-15876979
 ] 

Miao Wang commented on SPARK-19635:
---

https://github.com/apache/spark/pull/13440
[~timhunter] This one is related. The author asks for review. I @ you in the PR 
too.

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-21 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876977#comment-15876977
 ] 

Miao Wang commented on SPARK-19635:
---

I think there is a related PR opened for quite a while. Let me find it.

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19687) Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, kindly please help us with any examples.

2017-02-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19687.
---
Resolution: Invalid

Questions should go to u...@spark.apache.org

> Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, 
> kindly please help us with any examples.
> -
>
> Key: SPARK-19687
> URL: https://issues.apache.org/jira/browse/SPARK-19687
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Praveen Tallapudi
>
> Dear Team,
> I am little new to Scala development and trying to find the solution for the 
> below. Please forgive me if this is not the correct place to post this 
> question.
> I am trying to insert data from a data frame into postgres table.
>  
> Dataframe Schema:
> root
> |-- ID: string (nullable = true)
> |-- evtInfo: struct (nullable = true)
> ||-- @date: string (nullable = true)
> ||-- @time: string (nullable = true)
> ||-- @timeID: string (nullable = true)
> ||-- TranCode: string (nullable = true)
> ||-- custName: string (nullable = true)
> ||-- evtInfo: array (nullable = true)
> |||-- element: string (containsNull = true)
> ||-- Type: string (nullable = true)
> ||-- opID: string (nullable = true)
> ||-- tracNbr: string (nullable = true)
>  
>  
> DataBase Table Schema:
> CREATE TABLE public.test
> (
>id bigint NOT NULL,   
>evtInfo jsonb NOT NULL,
>evt_val bigint NOT NULL
> )
>  
> When I use dataFrame_toSave.write.mode(SaveMode.Append).jdbc(dbUrl, 
> "public.test", dbPropForDFtoSave) to save the data, I am seeing the below 
> error.
>  
> Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC 
> type for 
> struct<@dateEvt:string,@timeEvt:string,@timeID:string,CICSTranCode:string,custName:string,evtInfo:array,evtType:string,operID:string,trackingNbr:string>
>  
> Can you please suggest the best approach to save the data frame into the 
> posgres JSONB table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19688) Spark on Yarn Credentials File set to different application directory

2017-02-21 Thread Devaraj Jonnadula (JIRA)
Devaraj Jonnadula created SPARK-19688:
-

 Summary: Spark on Yarn Credentials File set to different 
application directory
 Key: SPARK-19688
 URL: https://issues.apache.org/jira/browse/SPARK-19688
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.6.3
Reporter: Devaraj Jonnadula
Priority: Minor


spark.yarn.credentials.file property is set to different application Id instead 
of actual Application Id 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19687) Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, kindly please help us with any examples.

2017-02-21 Thread Praveen Tallapudi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Tallapudi updated SPARK-19687:
--
Summary: Does SPARK supports for Postgres JSONB data type to store JSON 
data, if yes, kindly please help us with any examples.  (was: Does SPARK 
supports for Postgres JSONB to store JSON data, if yes, kindly please help us 
with any examples.)

> Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, 
> kindly please help us with any examples.
> -
>
> Key: SPARK-19687
> URL: https://issues.apache.org/jira/browse/SPARK-19687
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Praveen Tallapudi
>
> Dear Team,
> I am little new to Scala development and trying to find the solution for the 
> below. Please forgive me if this is not the correct place to post this 
> question.
> I am trying to insert data from a data frame into postgres table.
>  
> Dataframe Schema:
> root
> |-- ID: string (nullable = true)
> |-- evtInfo: struct (nullable = true)
> ||-- @date: string (nullable = true)
> ||-- @time: string (nullable = true)
> ||-- @timeID: string (nullable = true)
> ||-- TranCode: string (nullable = true)
> ||-- custName: string (nullable = true)
> ||-- evtInfo: array (nullable = true)
> |||-- element: string (containsNull = true)
> ||-- Type: string (nullable = true)
> ||-- opID: string (nullable = true)
> ||-- tracNbr: string (nullable = true)
>  
>  
> DataBase Table Schema:
> CREATE TABLE public.test
> (
>id bigint NOT NULL,   
>evtInfo jsonb NOT NULL,
>evt_val bigint NOT NULL
> )
>  
> When I use dataFrame_toSave.write.mode(SaveMode.Append).jdbc(dbUrl, 
> "public.test", dbPropForDFtoSave) to save the data, I am seeing the below 
> error.
>  
> Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC 
> type for 
> struct<@dateEvt:string,@timeEvt:string,@timeID:string,CICSTranCode:string,custName:string,evtInfo:array,evtType:string,operID:string,trackingNbr:string>
>  
> Can you please suggest the best approach to save the data frame into the 
> posgres JSONB table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19687) Does SPARK supports for Postgres JSONB to store JSON data, if yes, kindly please help us with any examples.

2017-02-21 Thread Praveen Tallapudi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Tallapudi updated SPARK-19687:
--
Summary: Does SPARK supports for Postgres JSONB to store JSON data, if yes, 
kindly please help us with any examples.  (was: SPARK supports for Postgres 
JSONB to store JSON data, if yes, kindly please help us with any examples.)

> Does SPARK supports for Postgres JSONB to store JSON data, if yes, kindly 
> please help us with any examples.
> ---
>
> Key: SPARK-19687
> URL: https://issues.apache.org/jira/browse/SPARK-19687
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Praveen Tallapudi
>
> Dear Team,
> I am little new to Scala development and trying to find the solution for the 
> below. Please forgive me if this is not the correct place to post this 
> question.
> I am trying to insert data from a data frame into postgres table.
>  
> Dataframe Schema:
> root
> |-- ID: string (nullable = true)
> |-- evtInfo: struct (nullable = true)
> ||-- @date: string (nullable = true)
> ||-- @time: string (nullable = true)
> ||-- @timeID: string (nullable = true)
> ||-- TranCode: string (nullable = true)
> ||-- custName: string (nullable = true)
> ||-- evtInfo: array (nullable = true)
> |||-- element: string (containsNull = true)
> ||-- Type: string (nullable = true)
> ||-- opID: string (nullable = true)
> ||-- tracNbr: string (nullable = true)
>  
>  
> DataBase Table Schema:
> CREATE TABLE public.test
> (
>id bigint NOT NULL,   
>evtInfo jsonb NOT NULL,
>evt_val bigint NOT NULL
> )
>  
> When I use dataFrame_toSave.write.mode(SaveMode.Append).jdbc(dbUrl, 
> "public.test", dbPropForDFtoSave) to save the data, I am seeing the below 
> error.
>  
> Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC 
> type for 
> struct<@dateEvt:string,@timeEvt:string,@timeID:string,CICSTranCode:string,custName:string,evtInfo:array,evtType:string,operID:string,trackingNbr:string>
>  
> Can you please suggest the best approach to save the data frame into the 
> posgres JSONB table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19687) SPARK supports for Postgres JSONB to store JSON data, if yes, kindly please help us with any examples.

2017-02-21 Thread Praveen Tallapudi (JIRA)
Praveen Tallapudi created SPARK-19687:
-

 Summary: SPARK supports for Postgres JSONB to store JSON data, if 
yes, kindly please help us with any examples.
 Key: SPARK-19687
 URL: https://issues.apache.org/jira/browse/SPARK-19687
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Praveen Tallapudi


Dear Team,
I am little new to Scala development and trying to find the solution for the 
below. Please forgive me if this is not the correct place to post this question.

I am trying to insert data from a data frame into postgres table.
 
Dataframe Schema:
root
|-- ID: string (nullable = true)
|-- evtInfo: struct (nullable = true)
||-- @date: string (nullable = true)
||-- @time: string (nullable = true)
||-- @timeID: string (nullable = true)
||-- TranCode: string (nullable = true)
||-- custName: string (nullable = true)
||-- evtInfo: array (nullable = true)
|||-- element: string (containsNull = true)
||-- Type: string (nullable = true)
||-- opID: string (nullable = true)
||-- tracNbr: string (nullable = true)
 
 
DataBase Table Schema:
CREATE TABLE public.test
(
   id bigint NOT NULL,   
   evtInfo jsonb NOT NULL,
   evt_val bigint NOT NULL
)
 
When I use dataFrame_toSave.write.mode(SaveMode.Append).jdbc(dbUrl, 
"public.test", dbPropForDFtoSave) to save the data, I am seeing the below error.
 
Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC 
type for 
struct<@dateEvt:string,@timeEvt:string,@timeID:string,CICSTranCode:string,custName:string,evtInfo:array,evtType:string,operID:string,trackingNbr:string>
 
Can you please suggest the best approach to save the data frame into the 
posgres JSONB table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19639) Add spark.svmLinear example and update vignettes

2017-02-21 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876936#comment-15876936
 ] 

Miao Wang commented on SPARK-19639:
---

This JIRA should be closed as the PR is merged. Thanks! cc [~felixcheung]

> Add spark.svmLinear example and update vignettes
> 
>
> Key: SPARK-19639
> URL: https://issues.apache.org/jira/browse/SPARK-19639
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Miao Wang
>
> We recently add the spark.svmLinear API for SparkR. We need to add an example 
> and update the vignettes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-21 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876929#comment-15876929
 ] 

Timothy Hunter commented on SPARK-19636:


Unless someone has started to work on this task, I will take it.

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19686) Spark poc projects

2017-02-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19686.
---
Resolution: Invalid

> Spark poc projects
> --
>
> Key: SPARK-19686
> URL: https://issues.apache.org/jira/browse/SPARK-19686
> Project: Spark
>  Issue Type: Epic
>  Components: Examples
>Affects Versions: 2.1.0
>Reporter: Srinivasa Kalyanachakravarthy Vasagiri
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19686) Spark poc projects

2017-02-21 Thread Srinivasa Kalyanachakravarthy Vasagiri (JIRA)
Srinivasa Kalyanachakravarthy Vasagiri created SPARK-19686:
--

 Summary: Spark poc projects
 Key: SPARK-19686
 URL: https://issues.apache.org/jira/browse/SPARK-19686
 Project: Spark
  Issue Type: Epic
  Components: Examples
Affects Versions: 2.1.0
Reporter: Srinivasa Kalyanachakravarthy Vasagiri






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876854#comment-15876854
 ] 

Shixiong Zhu edited comment on SPARK-19675 at 2/21/17 10:17 PM:


[~taroplus] If I understand correctly, SBT launches your application in a new 
JVM process *without any SBT codes*. But your Spark application requires to run 
Spark codes which is prebuilt with a specific Scala versions *before creating 
ExecutorClassLoader*. That's totally different.

In your example, `Class.forName("scala.Option", false, executorLoader)` doesn't 
call `ExecutorClassLoader.findClass` because "scala.Option" is already loaded.

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

import java.net._
import org.apache.spark._
import org.apache.spark.repl._

val desiredLoader = new URLClassLoader(
   Array(new URL("file:/tmp/scala-library-2.11.0.jar")),
   null)

val executorLoader = new ExecutorClassLoader(
  new SparkConf(),
  null,
  "",
  desiredLoader,
  false) {
  override def findClass(name: String): Class[_] = {
  println("finding class: " + name)
  super.findClass(name)
  }
}

Class.forName("scala.Option", false, executorLoader).getClassLoader()

// Exiting paste mode, now interpreting.

import java.net._
import org.apache.spark._
import org.apache.spark.repl._
desiredLoader: java.net.URLClassLoader = java.net.URLClassLoader@37f41a81
executorLoader: org.apache.spark.repl.ExecutorClassLoader = $anon$1@1c3d9e28
res0: ClassLoader = sun.misc.Launcher$AppClassLoader@4d76f3f8

scala> 
{code}

In the above example, you can see `findClass` is not called.



was (Author: zsxwing):
[~taroplus] If I understand correctly, SBT launches your application in a new 
JVM process *without any SBT codes*. But your Spark applications requires to 
run Spark codes which is prebuilt with a specific Scala versions *before 
creating ExecutorClassLoader*. That's totally different.

In your example, `Class.forName("scala.Option", false, executorLoader)` doesn't 
call `ExecutorClassLoader.findClass` because "scala.Option" is already loaded.

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

import java.net._
import org.apache.spark._
import org.apache.spark.repl._

val desiredLoader = new URLClassLoader(
   Array(new URL("file:/tmp/scala-library-2.11.0.jar")),
   null)

val executorLoader = new ExecutorClassLoader(
  new SparkConf(),
  null,
  "",
  desiredLoader,
  false) {
  override def findClass(name: String): Class[_] = {
  println("finding class: " + name)
  super.findClass(name)
  }
}

Class.forName("scala.Option", false, executorLoader).getClassLoader()

// Exiting paste mode, now interpreting.

import java.net._
import org.apache.spark._
import org.apache.spark.repl._
desiredLoader: java.net.URLClassLoader = java.net.URLClassLoader@37f41a81
executorLoader: org.apache.spark.repl.ExecutorClassLoader = $anon$1@1c3d9e28
res0: ClassLoader = sun.misc.Launcher$AppClassLoader@4d76f3f8

scala> 
{code}

In the above example, you can see `findClass` is not called.


> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct 

[jira] [Commented] (SPARK-19675) ExecutorClassLoader loads classes from SystemClassLoader

2017-02-21 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876854#comment-15876854
 ] 

Shixiong Zhu commented on SPARK-19675:
--

[~taroplus] If I understand correctly, SBT launches your application in a new 
JVM process *without any SBT codes*. But your Spark applications requires to 
run Spark codes which is prebuilt with a specific Scala versions *before 
creating ExecutorClassLoader*. That's totally different.

In your example, `Class.forName("scala.Option", false, executorLoader)` doesn't 
call `ExecutorClassLoader.findClass` because "scala.Option" is already loaded.

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

import java.net._
import org.apache.spark._
import org.apache.spark.repl._

val desiredLoader = new URLClassLoader(
   Array(new URL("file:/tmp/scala-library-2.11.0.jar")),
   null)

val executorLoader = new ExecutorClassLoader(
  new SparkConf(),
  null,
  "",
  desiredLoader,
  false) {
  override def findClass(name: String): Class[_] = {
  println("finding class: " + name)
  super.findClass(name)
  }
}

Class.forName("scala.Option", false, executorLoader).getClassLoader()

// Exiting paste mode, now interpreting.

import java.net._
import org.apache.spark._
import org.apache.spark.repl._
desiredLoader: java.net.URLClassLoader = java.net.URLClassLoader@37f41a81
executorLoader: org.apache.spark.repl.ExecutorClassLoader = $anon$1@1c3d9e28
res0: ClassLoader = sun.misc.Launcher$AppClassLoader@4d76f3f8

scala> 
{code}

In the above example, you can see `findClass` is not called.


> ExecutorClassLoader loads classes from SystemClassLoader
> 
>
> Key: SPARK-19675
> URL: https://issues.apache.org/jira/browse/SPARK-19675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: sbt / Play Framework
>Reporter: Kohki Nishio
>Priority: Minor
>
> Spark Executor loads classes from SystemClassLoader which contains 
> sbt-launch.jar and it contains Scala2.10 binary, however Spark itself is 
> built on Scala2.11, thus it's throwing InvalidClassException
> java.io.InvalidClassException: scala.Option; local class incompatible: stream 
> classdesc serialVersionUID = -114498752079829388, local class 
> serialVersionUID = 5081326844987135632
>   at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
>   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
> ExecutorClassLoader's desired class loder (parentLoader) actually contains 
> the correct path (scala-library-2.11.8.jar) but it is not being used.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19684) Move info about running specific tests to developer website

2017-02-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876846#comment-15876846
 ] 

Apache Spark commented on SPARK-19684:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/17018

> Move info about running specific tests to developer website
> ---
>
> Key: SPARK-19684
> URL: https://issues.apache.org/jira/browse/SPARK-19684
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> This JIRA accompanies this change to the website: 
> https://github.com/apache/spark-website/pull/33.
> Running individual tests is not something that changes with new versions of 
> the project, and is primarily used by developers (not users) so should be 
> moved to the developer-tools page of the main website (with a link from the 
> building-spark page on the release-specific docs).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19684) Move info about running specific tests to developer website

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19684:


Assignee: Apache Spark  (was: Kay Ousterhout)

> Move info about running specific tests to developer website
> ---
>
> Key: SPARK-19684
> URL: https://issues.apache.org/jira/browse/SPARK-19684
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.1
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>Priority: Minor
>
> This JIRA accompanies this change to the website: 
> https://github.com/apache/spark-website/pull/33.
> Running individual tests is not something that changes with new versions of 
> the project, and is primarily used by developers (not users) so should be 
> moved to the developer-tools page of the main website (with a link from the 
> building-spark page on the release-specific docs).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19684) Move info about running specific tests to developer website

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19684:


Assignee: Kay Ousterhout  (was: Apache Spark)

> Move info about running specific tests to developer website
> ---
>
> Key: SPARK-19684
> URL: https://issues.apache.org/jira/browse/SPARK-19684
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.1.1
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> This JIRA accompanies this change to the website: 
> https://github.com/apache/spark-website/pull/33.
> Running individual tests is not something that changes with new versions of 
> the project, and is primarily used by developers (not users) so should be 
> moved to the developer-tools page of the main website (with a link from the 
> building-spark page on the release-specific docs).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19685) PipedRDD tasks should not hang on interruption / errors

2017-02-21 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876828#comment-15876828
 ] 

Josh Rosen commented on SPARK-19685:


By the way, one simple fix here might be to use the "soft-kill then hard-kill 
after timeout" helper function in {{Utils}} to terminate the child process.

> PipedRDD tasks should not hang on interruption / errors
> ---
>
> Key: SPARK-19685
> URL: https://issues.apache.org/jira/browse/SPARK-19685
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>
> While looking at WARN and ERROR-level logs from Spark executors, I spotted a 
> problem where PipedRDD tasks may continue running after being cancelled or 
> after failing. Specifically, I saw many cancelled tasks hanging in the 
> following stacks:
> {code}
> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> java.io.FilterOutputStream.close(FilterOutputStream.java:158)
> java.lang.UNIXProcess.destroy(UNIXProcess.java:445)
> java.lang.UNIXProcess.destroy(UNIXProcess.java:478)
> org.apache.spark.rdd.PipedRDD$$anon$1.propagateChildException(PipedRDD.scala:203)
> org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:183)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
> scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
> scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
> scala.collection.AbstractIterator.fold(Iterator.scala:1336)
> org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
> org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> org.apache.spark.scheduler.Task.run(Task.scala:99)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> and 
> {code}
> java.io.FileInputStream.readBytes(Native Method)
> java.io.FileInputStream.read(FileInputStream.java:255)
> java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
> java.io.BufferedInputStream.read(BufferedInputStream.java:345)
> sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
> sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
> sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
> java.io.InputStreamReader.read(InputStreamReader.java:184)
> java.io.BufferedReader.fill(BufferedReader.java:161)
> java.io.BufferedReader.readLine(BufferedReader.java:324)
> java.io.BufferedReader.readLine(BufferedReader.java:389)
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
> org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:172)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
> scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
> scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
> scala.collection.AbstractIterator.fold(Iterator.scala:1336)
> org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
> org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> org.apache.spark.scheduler.Task.run(Task.scala:99)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}
> I do not have a minimal reproduction of this issue yet, but I suspect that we 
> can make one by having PipedRDD call a process which hangs indefinitely 
> without printing any output, then cancel the Spark job with 
> {{interruptOnCancel=true}}. If my hunch is right, we should witness the 
> PipedRDD tasks continuing to run either because the call to destroy the child 
> process 

[jira] [Created] (SPARK-19685) PipedRDD tasks should not hang on interruption / errors

2017-02-21 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-19685:
--

 Summary: PipedRDD tasks should not hang on interruption / errors
 Key: SPARK-19685
 URL: https://issues.apache.org/jira/browse/SPARK-19685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 2.0.0, 1.6.0
Reporter: Josh Rosen


While looking at WARN and ERROR-level logs from Spark executors, I spotted a 
problem where PipedRDD tasks may continue running after being cancelled or 
after failing. Specifically, I saw many cancelled tasks hanging in the 
following stacks:

{code}
java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
java.io.FilterOutputStream.close(FilterOutputStream.java:158)
java.lang.UNIXProcess.destroy(UNIXProcess.java:445)
java.lang.UNIXProcess.destroy(UNIXProcess.java:478)
org.apache.spark.rdd.PipedRDD$$anon$1.propagateChildException(PipedRDD.scala:203)
org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:183)
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
scala.collection.AbstractIterator.fold(Iterator.scala:1336)
org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
org.apache.spark.scheduler.Task.run(Task.scala:99)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{code}

and 

{code}
java.io.FileInputStream.readBytes(Native Method)
java.io.FileInputStream.read(FileInputStream.java:255)
java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
java.io.BufferedInputStream.read(BufferedInputStream.java:345)
sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
java.io.InputStreamReader.read(InputStreamReader.java:184)
java.io.BufferedReader.fill(BufferedReader.java:161)
java.io.BufferedReader.readLine(BufferedReader.java:324)
java.io.BufferedReader.readLine(BufferedReader.java:389)
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:172)
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
scala.collection.Iterator$class.foreach(Iterator.scala:893)
scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
scala.collection.AbstractIterator.fold(Iterator.scala:1336)
org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1086)
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
org.apache.spark.scheduler.Task.run(Task.scala:99)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{code}

I do not have a minimal reproduction of this issue yet, but I suspect that we 
can make one by having PipedRDD call a process which hangs indefinitely without 
printing any output, then cancel the Spark job with {{interruptOnCancel=true}}. 
If my hunch is right, we should witness the PipedRDD tasks continuing to run 
either because the call to destroy the child process is hanging or because we 
don't check whether the task has been interrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19683) Support for libsvm-based learning-to-rank format

2017-02-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876812#comment-15876812
 ] 

Sean Owen commented on SPARK-19683:
---

This is trivial for an application to implement. Unless it is a common format 
extension and one the core library can use, is there much particular reason for 
it to be in Spark vs app or external lib?

> Support for libsvm-based learning-to-rank format
> 
>
> Key: SPARK-19683
> URL: https://issues.apache.org/jira/browse/SPARK-19683
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Craig Macdonald
>Priority: Minor
>
> I would like to use Spark for reading/processing Learning to Rank files. The 
> standard format is an extension of libsvm:
> {code}
> 0 qid:1 1:2.9 2:9.4 # docid=clueweb09-00-01492
> {code}
> Under the mlib API, a LabeledPoint would need an extension called 
> QueryLabeledPoint.
> I would also like to investigate use through the DataFrame, extending the 
> libsvm source, however many of the classes/methods used there are private 
> (e.g. LibSVMOptions, Datatype.sameType(), VectorUDT). So would an extension 
> to handle LTR format be better inside Spark or outside?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH

2017-02-21 Thread Yun Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15874956#comment-15874956
 ] 

Yun Ni edited comment on SPARK-18454 at 2/21/17 9:39 PM:
-

[~josephkb] [~sethah] [~mlnick] [~karlhigley] I opened a gDoc for discussions. 
Feel free to make comments and add any new sections:

https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit


was (Author: yunn):
[~josephkb] [~sethah] [~mlnick] I opened a gDoc for discussions. Feel free to 
make comments and add any new sections:

https://docs.google.com/document/d/1opWy2ohXaDWjamV8iC0NKbaZL9JsjZCix2Av5SS3D9g/edit

> Changes to improve Nearest Neighbor Search for LSH
> --
>
> Key: SPARK-18454
> URL: https://issues.apache.org/jira/browse/SPARK-18454
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yun Ni
>
> We all agree to do the following improvement to Multi-Probe NN Search:
> (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing 
> full sort on the whole dataset
> Currently we are still discussing the following:
> (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}}
> (2) What are the issues and how we should change the current Nearest Neighbor 
> implementation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19617) Fix the race condition when starting and stopping a query quickly

2017-02-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19617:
-
Comment: was deleted

(was: User 'gf53520' has created a pull request for this issue:
https://github.com/apache/spark/pull/16980)

> Fix the race condition when starting and stopping a query quickly
> -
>
> Key: SPARK-19617
> URL: https://issues.apache.org/jira/browse/SPARK-19617
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> The streaming thread in StreamExecution uses the following ways to check if 
> it should exit:
> - Catch an InterruptException.
> - `StreamExecution.state` is TERMINATED.
> when starting and stopping a query quickly, the above two checks may both 
> fail.
> - Hit [HADOOP-14084|https://issues.apache.org/jira/browse/HADOOP-14084] and 
> swallow InterruptException
> - StreamExecution.stop is called before `state` becomes `ACTIVE`. Then 
> [runBatches|https://github.com/apache/spark/blob/dcc2d540a53f0bd04baead43fdee1c170ef2b9f3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L252]
>  changes the state from `TERMINATED` to `ACTIVE`.
> If the above cases both happen, the query will hang forever.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19677) HDFSBackedStateStoreProvider fails to overwrite existing file

2017-02-21 Thread Heji Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876747#comment-15876747
 ] 

Heji Kim commented on SPARK-19677:
--

Thank you for reporting this issue!  I just wanted to add that we get the same 
HDFS error when we restart our structured streaming drivers but also when we 
try to run  more complex driver using withWatermark/agg/groupBy/orderBy, we get 
in the first run without restart:

java.lang.IllegalStateException: Error committing version 1 into 
HDFSStateStore[id = (op=0, part=15), dir = 
/user/spark/checkpoints/StructuredStreamingSignalAggregation/ss_StructuredStreamingSignalAggregation/state/0/15]
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.commit(HDFSBackedStateStoreProvider.scala:162)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:138)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(StatefulAggregate.scala:123)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323).

> HDFSBackedStateStoreProvider fails to overwrite existing file
> -
>
> Key: SPARK-19677
> URL: https://issues.apache.org/jira/browse/SPARK-19677
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Roberto Agostino Vitillo
>Priority: Critical
>
> I got the exception below after restarting a crashed Structured Streaming 
> application. This seems to be due to the fact that 
> {{/tmp/checkpoint/state/0/0/214451.delta}} already exists in HDFS.
> {code}
> 17/02/20 14:14:26 ERROR StreamExecution: Query [id = 
> 5023231c-2433-4013-a8b9-d54bb5751445, runId = 
> 4168cf31-7d0b-4435-9b58-28919abd937b] terminated with error
> org.apache.spark.SparkException: Job aborted.
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:78)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:503)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:503)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:503)
> at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:262)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:46)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:502)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply$mcV$sp(StreamExecution.scala:255)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
> at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$1.apply(StreamExecution.scala:244)
> at 
> 

[jira] [Updated] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2017-02-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19680:
-
Component/s: (was: Structured Streaming)
 DStreams

> Offsets out of range with no configured reset policy for partitions
> ---
>
> Key: SPARK-19680
> URL: https://issues.apache.org/jira/browse/SPARK-19680
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Schakmann Rene
>
> I'm using spark streaming with kafka to acutally create a toplist. I want to 
> read all the messages in kafka. So I set
>"auto.offset.reset" -> "earliest"
> Nevertheless when I start the job on our spark cluster it is not working I 
> get:
> Error:
> {code:title=error.log|borderStyle=solid}
>   Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
> most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
> executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {SearchEvents-2=161803385}
> {code}
> This is somehow wrong because I did set the auto.offset.reset property
> Setup:
> Kafka Parameter:
> {code:title=Config.Scala|borderStyle=solid}
>   def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
> Object] = {
> Map(
>   "bootstrap.servers" -> 
> properties.getProperty("kafka.bootstrap.servers"),
>   "group.id" -> properties.getProperty("kafka.consumer.group"),
>   "auto.offset.reset" -> "earliest",
>   "spark.streaming.kafka.consumer.cache.enabled" -> "false",
>   "enable.auto.commit" -> "false",
>   "key.deserializer" -> classOf[StringDeserializer],
>   "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
>   }
> {code}
> Job:
> {code:title=Job.Scala|borderStyle=solid}
>   def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
> Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
> Broadcast[KafkaSink[TopList]]): Unit = {
> getFilteredStream(stream.map(_.value()), windowDuration, 
> slideDuration).foreachRDD(rdd => {
>   val topList = new TopList
>   topList.setCreated(new Date())
>   topList.setTopListEntryList(rdd.take(TopListLength).toList)
>   CurrentLogger.info("TopList length: " + 
> topList.getTopListEntryList.size().toString)
>   kafkaSink.value.send(SendToTopicName, topList)
>   CurrentLogger.info("Last Run: " + System.currentTimeMillis())
> })
>   }
>   def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, 
> slideDuration: Int): DStream[TopListEntry] = {
> val Mapper = MapperObject.readerFor[SearchEventDTO]
> result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s))
>   .filter(s => s != null && s.getSearchRequest != null && 
> s.getSearchRequest.getSearchParameters != null && s.getVertical == 
> Vertical.BAP && 
> s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName))
>   .map(row => {
> val name = 
> row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase()
> (name, new TopListEntry(name, 1, row.getResultCount))
>   })
>   .reduceByKeyAndWindow(
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + 
> b.getMeanSearchHits),
> (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
> a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - 
> b.getMeanSearchHits),
> Minutes(windowDuration),
> Seconds(slideDuration))
>   .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L)
>   .map(row => (row._2.getSearchCount, row._2))
>   .transform(rdd => rdd.sortByKey(ascending = false))
>   .map(row => new TopListEntry(row._2.getKeyword, row._2.getSearchCount, 
> row._2.getMeanSearchHits / row._2.getSearchCount))
>   }
>   def main(properties: Properties): Unit = {
> val sparkSession = SparkUtil.getDefaultSparkSession(properties, TaskName)
> val kafkaSink = 
> sparkSession.sparkContext.broadcast(KafkaSinkUtil.apply[TopList](SparkUtil.getDefaultSparkProperties(properties)))
> val kafkaParams: Map[String, Object] = 
> SparkUtil.getDefaultKafkaReceiverParameter(properties)
> val ssc = new StreamingContext(sparkSession.sparkContext, Seconds(30))
> ssc.checkpoint("/home/spark/checkpoints")
> val adEventStream =
>   KafkaUtils.createDirectStream[String, Array[Byte]](ssc, 
> PreferConsistent, Subscribe[String, Array[Byte]](Array(ReadFromTopicName), 
> kafkaParams))
> processSearchKeyWords(adEventStream, 
> 

[jira] [Created] (SPARK-19684) Move info about running specific tests to developer website

2017-02-21 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19684:
--

 Summary: Move info about running specific tests to developer 
website
 Key: SPARK-19684
 URL: https://issues.apache.org/jira/browse/SPARK-19684
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


This JIRA accompanies this change to the website: 
https://github.com/apache/spark-website/pull/33.

Running individual tests is not something that changes with new versions of the 
project, and is primarily used by developers (not users) so should be moved to 
the developer-tools page of the main website (with a link from the 
building-spark page on the release-specific docs).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19683) Support for libsvm-based learning-to-rank format

2017-02-21 Thread Craig Macdonald (JIRA)
Craig Macdonald created SPARK-19683:
---

 Summary: Support for libsvm-based learning-to-rank format
 Key: SPARK-19683
 URL: https://issues.apache.org/jira/browse/SPARK-19683
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.1.0
Reporter: Craig Macdonald
Priority: Minor


I would like to use Spark for reading/processing Learning to Rank files. The 
standard format is an extension of libsvm:

{code}
0 qid:1 1:2.9 2:9.4 # docid=clueweb09-00-01492
{code}

Under the mlib API, a LabeledPoint would need an extension called 
QueryLabeledPoint.

I would also like to investigate use through the DataFrame, extending the 
libsvm source, however many of the classes/methods used there are private (e.g. 
LibSVMOptions, Datatype.sameType(), VectorUDT). So would an extension to handle 
LTR format be better inside Spark or outside?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19497) dropDuplicates with watermark

2017-02-21 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19497:
-
Labels: release_notes  (was: )

> dropDuplicates with watermark
> -
>
> Key: SPARK-19497
> URL: https://issues.apache.org/jira/browse/SPARK-19497
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Shixiong Zhu
>Priority: Critical
>  Labels: release_notes
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19682) Issue warning (or error) when subset method "[[" takes vector index

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19682:


Assignee: (was: Apache Spark)

> Issue warning (or error) when subset method "[[" takes vector index
> ---
>
> Key: SPARK-19682
> URL: https://issues.apache.org/jira/browse/SPARK-19682
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> The `[[` method is supposed to take a single index and return a column. This 
> is different from base R which takes a vector index.  We should check for 
> this and issue warning or error when vector index is supplied (which is very 
> likely given the behavior in base R). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19682) Issue warning (or error) when subset method "[[" takes vector index

2017-02-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19682:


Assignee: Apache Spark

> Issue warning (or error) when subset method "[[" takes vector index
> ---
>
> Key: SPARK-19682
> URL: https://issues.apache.org/jira/browse/SPARK-19682
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> The `[[` method is supposed to take a single index and return a column. This 
> is different from base R which takes a vector index.  We should check for 
> this and issue warning or error when vector index is supplied (which is very 
> likely given the behavior in base R). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19682) Issue warning (or error) when subset method "[[" takes vector index

2017-02-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876497#comment-15876497
 ] 

Apache Spark commented on SPARK-19682:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17017

> Issue warning (or error) when subset method "[[" takes vector index
> ---
>
> Key: SPARK-19682
> URL: https://issues.apache.org/jira/browse/SPARK-19682
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> The `[[` method is supposed to take a single index and return a column. This 
> is different from base R which takes a vector index.  We should check for 
> this and issue warning or error when vector index is supplied (which is very 
> likely given the behavior in base R). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19682) Issue warning (or error) when subset method "[[" takes vector index

2017-02-21 Thread Wayne Zhang (JIRA)
Wayne Zhang created SPARK-19682:
---

 Summary: Issue warning (or error) when subset method "[[" takes 
vector index
 Key: SPARK-19682
 URL: https://issues.apache.org/jira/browse/SPARK-19682
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.1.0
Reporter: Wayne Zhang
Priority: Minor


The `[[` method is supposed to take a single index and return a column. This is 
different from base R which takes a vector index.  We should check for this and 
issue warning or error when vector index is supplied (which is very likely 
given the behavior in base R). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-02-21 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876466#comment-15876466
 ] 

Imran Rashid commented on SPARK-19659:
--

[~jinxing6...@126.com]  Thanks for taking this on, I think this is a *really* 
important improvement for Spark -- but its also changing some very core logic 
which I think needs to be done with a lot of caution.  Can you please post a 
design doc here for discussion?

While the heuristics you are proposing seem reasonable, I have a number of 
concerns:

* what about when there are > 2k partitions, and the block size is unknown?  
especially in the case of skew, this is a huge problem.  Perhaps first we 
should just tackle that problem, to have better size estimations (with bounded 
error) in that case.
* I think it will need to configured independently from maxBytesInFlight
* Would it be possible to make the shuffle fetch memory usage get tracked by 
the memorymanager?  That would be another way to avoid OOM.  Note this pretty 
tricky since right now that memory is controlled by netty.
* what are the performance ramifications of these changes?  What tests are done 
to understand the effects?

I still think that having the shuffle fetch streamed to disk is a good idea, 
but we should think carefully about the right way to control it, and some of 
these other ideas should be done first, perhaps.  Its at least worth discussing 
before just doing the implementation.

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19681) save and load pipeline and then use it yield java.lang.RuntimeException

2017-02-21 Thread JIRA
Boris Clémençon  created SPARK-19681:


 Summary: save and load pipeline and then use it yield 
java.lang.RuntimeException
 Key: SPARK-19681
 URL: https://issues.apache.org/jira/browse/SPARK-19681
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Boris Clémençon 


Here is the unit test that fails:


import org.apache.spark.SparkConf
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{SQLTransformer, VectorAssembler}
import org.apache.spark.ml.tuning.{CrossValidator, CrossValidatorModel, 
ParamGridBuilder}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.scalatest.{BeforeAndAfter, FlatSpec, Matchers}

import scala.util.Random


/**
  * Created by borisclemencon on 21/02/2017.
  */
class PipelineTest extends FlatSpec with Matchers with BeforeAndAfter {

  val featuresCol = "features"
  val responseCol = "response"
  val weightCol = "weight"
  val features = Array("X1", "X2")
  val lambdas = Array(0.01)

  val alpha = 0.2
  val maxIter = 50
  val nfolds = 5

  var spark: SparkSession = _

  before {
val sparkConf: SparkConf = new SparkConf().
  set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
  set("spark.ui.enabled", "false"). // faster and remove 'spark test 
java.net.BindException: Address already in use' warnings!
  set("spark.driver.host", "127.0.0.1")

spark = SparkSession.
  builder().
  config(sparkConf).
  appName("BlendWeightTransformerTest").
  master("local[*]").
  getOrCreate()
  }


  def makeDataset(n: Int = 100): DataFrame = {
val sc = spark
import sc.implicits._
val n = 1000
val data =
  for (i <- 1 to n) yield {
val pn = if (Random.nextDouble() < 0.1) "a" else "b"
val x1: Double = Random.nextGaussian() * 5
val x2: Double = Random.nextGaussian() * 2
val response: Int = if (Random.nextBoolean()) 1 else 0
(pn, x1, x2, response)
  }
data.toDF(packageNameCol, "X1", "X2", responseCol)
  }

  "load()" should "produce the same pipeline and result before and after 
save()" in {

val lr = new LogisticRegression().
  setFitIntercept(true).
  setMaxIter(maxIter).
  setElasticNetParam(alpha).
  setStandardization(true).
  setFamily("binomial").
  setFeaturesCol(featuresCol).
  setLabelCol(responseCol)

val assembler = new 
VectorAssembler().setInputCols(features).setOutputCol(featuresCol)
val pipeline = new Pipeline().setStages(Array(assembler, lr))
val evaluator = new BinaryClassificationEvaluator().
  setLabelCol(responseCol).
  setMetricName("areaUnderROC")
val paramGrid = new ParamGridBuilder().
  addGrid(lr.regParam, lambdas).
  build()

// Train with simple grid cross validation
val cv = new CrossValidator().
  setEstimator(pipeline).
  setEvaluator(evaluator).
  setEstimatorParamMaps(paramGrid).
  setNumFolds(nfolds) // Use 3+ in practice

val df = makeDataset(100).cache
val cvModel = cv.fit(df)

val answer = cvModel.transform(df)
answer.show(truncate = false)

val path = "./PipelineTestcvModel"
cvModel.write.overwrite().save(path)

val cvModelLoaded = CrossValidatorModel.load(path)
val output = cvModelLoaded.transform(df)
output.show(truncate = false)
Compare.assertDataFrameEquals(answer, output)
  }
}

yield exception

should produce the same blent pipeline and result before and after save() *** 
FAILED ***
[info]   java.lang.RuntimeException: no default for type 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7
[info]   at 
org.apache.spark.sql.catalyst.expressions.Literal$.default(literals.scala:179)
[info]   at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:121)
[info]   at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$$anonfun$4.apply(patterns.scala:114)
[info]   at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info]   at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
[info]   at scala.collection.immutable.List.foreach(List.scala:381)
[info]   at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
[info]   at scala.collection.immutable.List.flatMap(List.scala:344)
[info]   at 
org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys$.unapply(patterns.scala:114)
[info]   at 
org.apache.spark.sql.execution.SparkStrategies$JoinSelection$.apply(SparkStrategies.scala:158)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-19626) Configuration `spark.yarn.credentials.updateTime` takes no effect

2017-02-21 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19626.

   Resolution: Fixed
 Assignee: Kent Yao
Fix Version/s: 2.2.0
   2.1.1

> Configuration `spark.yarn.credentials.updateTime` takes no effect
> -
>
> Key: SPARK-19626
> URL: https://issues.apache.org/jira/browse/SPARK-19626
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 2.1.1, 2.2.0
>
>
> In [SPARK-14743|https://github.com/apache/spark/pull/14065], we introduced a 
> configurable credential manager for Spark running on YARN. Also two configs 
> *spark.yarn.credentials.renewalTime* and *spark.yarn.credentials.updateTime* 
> were added, one is for the credential renewer and the other updater. But now 
> we just query *spark.yarn.credentials.renewalTime* by mistake during 
> CREDENTIALS UPDATING, where should be actually 
> *spark.yarn.credentials.updateTime*. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19337) Documentation and examples for LinearSVC

2017-02-21 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19337.
--
   Resolution: Fixed
 Assignee: yuhao yang
Fix Version/s: 2.2.0

> Documentation and examples for LinearSVC
> 
>
> Key: SPARK-19337
> URL: https://issues.apache.org/jira/browse/SPARK-19337
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
> Fix For: 2.2.0
>
>
> User guide + example code for LinearSVC



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19680) Offsets out of range with no configured reset policy for partitions

2017-02-21 Thread Schakmann Rene (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Schakmann Rene updated SPARK-19680:
---
Description: 
I'm using spark streaming with kafka to acutally create a toplist. I want to 
read all the messages in kafka. So I set

   "auto.offset.reset" -> "earliest"

Nevertheless when I start the job on our spark cluster it is not working I get:

Error:
{code:title=error.log|borderStyle=solid}
Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
Offsets out of range with no configured reset policy for partitions: 
{SearchEvents-2=161803385}
{code}
This is somehow wrong because I did set the auto.offset.reset property

Setup:

Kafka Parameter:


{code:title=Config.Scala|borderStyle=solid}
  def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, 
Object] = {
Map(
  "bootstrap.servers" -> properties.getProperty("kafka.bootstrap.servers"),
  "group.id" -> properties.getProperty("kafka.consumer.group"),
  "auto.offset.reset" -> "earliest",
  "spark.streaming.kafka.consumer.cache.enabled" -> "false",
  "enable.auto.commit" -> "false",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer")
  }
{code}
Job:

{code:title=Job.Scala|borderStyle=solid}
  def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, 
Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: 
Broadcast[KafkaSink[TopList]]): Unit = {
getFilteredStream(stream.map(_.value()), windowDuration, 
slideDuration).foreachRDD(rdd => {
  val topList = new TopList
  topList.setCreated(new Date())
  topList.setTopListEntryList(rdd.take(TopListLength).toList)
  CurrentLogger.info("TopList length: " + 
topList.getTopListEntryList.size().toString)
  kafkaSink.value.send(SendToTopicName, topList)
  CurrentLogger.info("Last Run: " + System.currentTimeMillis())
})

  }

  def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, 
slideDuration: Int): DStream[TopListEntry] = {

val Mapper = MapperObject.readerFor[SearchEventDTO]

result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s))
  .filter(s => s != null && s.getSearchRequest != null && 
s.getSearchRequest.getSearchParameters != null && s.getVertical == Vertical.BAP 
&& 
s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName))
  .map(row => {
val name = 
row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase()
(name, new TopListEntry(name, 1, row.getResultCount))
  })
  .reduceByKeyAndWindow(
(a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + b.getMeanSearchHits),
(a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, 
a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - b.getMeanSearchHits),
Minutes(windowDuration),
Seconds(slideDuration))
  .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L)
  .map(row => (row._2.getSearchCount, row._2))
  .transform(rdd => rdd.sortByKey(ascending = false))
  .map(row => new TopListEntry(row._2.getKeyword, row._2.getSearchCount, 
row._2.getMeanSearchHits / row._2.getSearchCount))
  }

  def main(properties: Properties): Unit = {

val sparkSession = SparkUtil.getDefaultSparkSession(properties, TaskName)
val kafkaSink = 
sparkSession.sparkContext.broadcast(KafkaSinkUtil.apply[TopList](SparkUtil.getDefaultSparkProperties(properties)))
val kafkaParams: Map[String, Object] = 
SparkUtil.getDefaultKafkaReceiverParameter(properties)
val ssc = new StreamingContext(sparkSession.sparkContext, Seconds(30))

ssc.checkpoint("/home/spark/checkpoints")
val adEventStream =
  KafkaUtils.createDirectStream[String, Array[Byte]](ssc, PreferConsistent, 
Subscribe[String, Array[Byte]](Array(ReadFromTopicName), kafkaParams))

processSearchKeyWords(adEventStream, 
SparkUtil.getWindowDuration(properties), 
SparkUtil.getSlideDuration(properties), kafkaSink)

ssc.start()
ssc.awaitTermination()

  }
{code}



  was:
I'm using spark streaming with kafka to acutally create a toplist. I want to 
read all the messages in kafka. So I set

   "auto.offset.reset" -> "earliest"

Nevertheless when I start the job on our spark cluster it is not working I get:

Error:

Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, 
most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, 
executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 

  1   2   >