date:20180424

[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451743#comment-16451743
 ] 

yucai commented on SPARK-24076:
---

1. When shuffle.partition = 8192, tuples in the same partition follows the 
connection like below:

hash(tuple x) = hash(tuple y) + n * 8192

2. In the next HashAggregate stage, tuples from the same partition need put 
into a 16K BytesToBytesMap (unsafeRowAggBuffer).

Here, the HashAggregate uses the same hash algorithm and seed as shuffle, it 
leads to all tuples will be hashed to only 2 different places actually. That's 
why hash conflict happens.

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20087) Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd listeners

2018-04-24 Thread Xianjin YE (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451732#comment-16451732
 ] 

Xianjin YE commented on SPARK-20087:


cc [~jiangxb1987] [~irashid], I am going to send a new pr if you still think 
this is the desired behaviour.

> Include accumulators / taskMetrics when sending TaskKilled to onTaskEnd 
> listeners
> -
>
> Key: SPARK-20087
> URL: https://issues.apache.org/jira/browse/SPARK-20087
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Charles Lewis
>Priority: Major
>
> When tasks end due to an ExceptionFailure, subscribers to onTaskEnd receive 
> accumulators / task metrics for that task, if they were still available. 
> These metrics are not currently sent when tasks are killed intentionally, 
> such as when a speculative retry finishes, and the original is killed (or 
> vice versa). Since we're killing these tasks ourselves, these metrics should 
> almost always exist, and we should treat them the same way as we treat 
> ExceptionFailures.
> Sending these metrics with the TaskKilled end reason makes aggregation across 
> all tasks in an app more accurate. This data can inform decisions about how 
> to tune the speculation parameters in order to minimize duplicated work, and 
> in general, the total cost of an app should include both successful and 
> failed tasks, if that information exists.
> PR: https://github.com/apache/spark/pull/17422



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24082) [Spark SQL] Tables are not listing under DB

2018-04-24 Thread ABHISHEK KUMAR GUPTA (JIRA)

ABHISHEK KUMAR GUPTA created SPARK-24082:


 Summary: [Spark SQL] Tables are not listing under DB
 Key: SPARK-24082
 URL: https://issues.apache.org/jira/browse/SPARK-24082
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: OS: Suse11

Spark Version: 2.3
Reporter: ABHISHEK KUMAR GUPTA


Steps:
 # Launch spark-sql --master yarn
 # use one;(* DB name is one )
 # . create table csvTable (time timestamp, name string, isright boolean, 
datetoday date, num binary, height double, score float, decimaler 
decimal(10,0), id tinyint, age int, license bigint, length smallint) using CSV 
options (path "/user/datatmo/customer1.csv");
 # CREATE TABLE Parquettemp USING org.apache.spark.sql.parquet OPTIONS (path 
"/user/sparkhive/warehouse/one.db/test1.parquet/");
 # show tables;

Tables are listing as below

one csvtable false
one parquettemp false



But when listing the tables under "one.db" in HDFS FIle system, csvTable  and 
Parquettemp  tables are not listing.

Used below command

BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls 
/user/sparkhive/warehouse/one.db

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451728#comment-16451728
 ] 

yucai commented on SPARK-24076:
---

Root cause: very bad hash conflict in hashaggregate.

!image-2018-04-25-14-29-39-958.png!

 

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24076:
--
Attachment: image-2018-04-25-14-29-39-958.png

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: image-2018-04-25-14-29-39-958.png, p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24081) Spark SQL drops the table while writing into table in "overwrite" mode.

2018-04-24 Thread Ashish (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish updated SPARK-24081:
---
Priority: Blocker  (was: Major)

> Spark SQL drops the table  while writing into table in "overwrite" mode.
> 
>
> Key: SPARK-24081
> URL: https://issues.apache.org/jira/browse/SPARK-24081
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Ashish
>Priority: Blocker
>
> I am taking data from table and doing  modification to the data once I am 
> writing back to table in overwrite mode its deleting all the record.
> Expectation: It will update the table with updated data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451727#comment-16451727
 ] 

yucai commented on SPARK-24076:
---

The query example:

{code:sql}
insert overwrite table target_xxx
SELECT
 item_id,
 auct_end_dt
FROM
 (select cast(item_id as double) as item_id, auct_end_dt from source_xxx
GROUP BY
 item_id,
 auct_end_dt
{code}


> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24074) Maven package resolver downloads javadoc instead of jar

2018-04-24 Thread Nadav Samet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451699#comment-16451699
 ] 

Nadav Samet commented on SPARK-24074:
-

Also, this problem doesn't occur with Spark 2.2.1, but happens on Spark 2.3.0 - 
this also makes it appear as though it is related to the maven resolution 
mechanism in Spark.

> Maven package resolver downloads javadoc instead of jar
> ---
>
> Key: SPARK-24074
> URL: https://issues.apache.org/jira/browse/SPARK-24074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Nadav Samet
>Priority: Major
>
> {code:java}
> // code placeholder
> {code}
> From some reason spark downloads a javadoc artifact of a package instead of 
> the jar.
> Steps to reproduce:
>  # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
> fetch artifacts from central:
> {code:java}
> rm -rf ~/.ivy2
> {code}
> 1. Run:
> {code:java}
> ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
> org.scalanlp:breeze_2.11:0.13.2{code}
> 2.Spark would download the javadoc instead of the jar:
> {code:java}
> downloading 
> https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
>  ...
> [SUCCESSFUL ] 
> net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
> (610ms){code}
> 3. Later spark would complain that it couldn't find the jar:
> {code:java}
> Warning: Local jar 
> /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar 
> does not exist, skipping.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).{code}
> 4. The dependency of breeze on f2j_arpack_combined seem fine: 
> [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24074) Maven package resolver downloads javadoc instead of jar

2018-04-24 Thread Nadav Samet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451692#comment-16451692
 ] 

Nadav Samet commented on SPARK-24074:
-

I think that breeze is fine: SBT is able to correctly resolve and download the 
dependencies of breeze. Also, the pom files for both projects seem valid (only 
manually looked at them). I am not sure what's causing this, but somehow Spark 
ends up downloading the wrong artifact, while SBT doesn't.

> Maven package resolver downloads javadoc instead of jar
> ---
>
> Key: SPARK-24074
> URL: https://issues.apache.org/jira/browse/SPARK-24074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Nadav Samet
>Priority: Major
>
> {code:java}
> // code placeholder
> {code}
> From some reason spark downloads a javadoc artifact of a package instead of 
> the jar.
> Steps to reproduce:
>  # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
> fetch artifacts from central:
> {code:java}
> rm -rf ~/.ivy2
> {code}
> 1. Run:
> {code:java}
> ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
> org.scalanlp:breeze_2.11:0.13.2{code}
> 2.Spark would download the javadoc instead of the jar:
> {code:java}
> downloading 
> https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
>  ...
> [SUCCESSFUL ] 
> net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
> (610ms){code}
> 3. Later spark would complain that it couldn't find the jar:
> {code:java}
> Warning: Local jar 
> /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar 
> does not exist, skipping.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).{code}
> 4. The dependency of breeze on f2j_arpack_combined seem fine: 
> [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19256) Hive bucketing support

2018-04-24 Thread Xianjin YE (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451689#comment-16451689
 ] 

Xianjin YE commented on SPARK-19256:


Hi [~tejasp] [~cloud_fan], are you still working on this? We also need this 
feature in our internal Spark stack, is there anything pending?

cc [~XuanYuan]

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2018-04-24 Thread chris_j (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chris_j updated SPARK-24009:

Description: 
local mode  spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully.

on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not 
permission problem also 

 

1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write local directory successful

2.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write hdfs successful

3.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
'/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
TEXTFILE select * from default.dim_date"  on yarn write local directory failed

 

 

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: Mkdirs failed to create 
[file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
 (exists=false, 
cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
 at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
 at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
 ... 8 more
 Caused by: java.io.IOException: Mkdirs failed to create 
[file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
 (exists=false, 
cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
 at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
 at 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
 at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)

 

  was:
local mode  spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully.

on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not 
permission problem also 

 

1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write local directory successful

2.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
default.dim_date"  write hdfs successful

3.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
'/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
TEXTFILE select * from default.dim_date"  on yarn writr local directory fail

[jira] [Commented] (SPARK-24078) reduce with unionAll takes a long time

2018-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451679#comment-16451679
 ] 

Hyukjin Kwon commented on SPARK-24078:
--

Would you be able to test this in higher versions?

> reduce with unionAll takes a long time
> --
>
> Key: SPARK-24078
> URL: https://issues.apache.org/jira/browse/SPARK-24078
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.3
>Reporter: zhangsongcheng
>Priority: Major
>
> I try to sample the traning sets with each category，and then uion all samples 
> together.This is my code:
> def balance4Single(dataSet: DataFrame): DataFrame = {
>   val samples = LabelConf.cardIDList.map { cardID =>
>   val tmpDataSet = dataSet.filter(col("card_id") === cardID)
>   val sample = underSample(tmpDataSet, cardID)
>   sample
> }
>   samples.reduce((x, y) => x.unionAll(y))
> } 
> def underSample(dataSet: DataFrame, cardID: String): DataFrame = {
>   val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1) 
>   val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1) 
>   positiveSample.unionAll(negativeSample).distinct()
> }
>  But the code blocked in {{samples.reduce((x, y) => x.unionAll(y))}}, and it 
> runs slowly and slowly, and even cannot run any more.It confused me a long 
> time.Who can help me? Than you!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24077) Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?

2018-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24077.
--
  Resolution: Invalid
   Fix Version/s: (was: 3.0.0)
Target Version/s:   (was: 2.3.0)

Questions should go to mailing lists. I am resolving this.

> Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
> 
>
> Key: SPARK-24077
> URL: https://issues.apache.org/jira/browse/SPARK-24077
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Benedict Jin
>Priority: Major
>
> Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?
>  
> scala> 
> org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
>  TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)
> == SQL ==
>  CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
> 'org.apache.spark.sql.hive.udf.YuZhouWan'
>  -^^^
>  at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>  at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
>  at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
>  ... 48 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24074) Maven package resolver downloads javadoc instead of jar

2018-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451677#comment-16451677
 ] 

Hyukjin Kwon commented on SPARK-24074:
--

I haven't looked into this yet but doesn't that sound more likely specific to 
breeze?

> Maven package resolver downloads javadoc instead of jar
> ---
>
> Key: SPARK-24074
> URL: https://issues.apache.org/jira/browse/SPARK-24074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Nadav Samet
>Priority: Major
>
> {code:java}
> // code placeholder
> {code}
> From some reason spark downloads a javadoc artifact of a package instead of 
> the jar.
> Steps to reproduce:
>  # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
> fetch artifacts from central:
> {code:java}
> rm -rf ~/.ivy2
> {code}
> 1. Run:
> {code:java}
> ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
> org.scalanlp:breeze_2.11:0.13.2{code}
> 2.Spark would download the javadoc instead of the jar:
> {code:java}
> downloading 
> https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
>  ...
> [SUCCESSFUL ] 
> net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
> (610ms){code}
> 3. Later spark would complain that it couldn't find the jar:
> {code:java}
> Warning: Local jar 
> /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar 
> does not exist, skipping.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).{code}
> 4. The dependency of breeze on f2j_arpack_combined seem fine: 
> [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2018-04-24 Thread Spark User (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451666#comment-16451666
 ] 

Spark User commented on SPARK-5594:
---

In my case, this issue was happening when spark context doesn't close 
successfully. When spark context closes abruptly, the files in spark-local and 
spark-worker directories are left uncleaned. The next time any job is run, the 
broadcast exception occurs. I managed a workaround by redirecting spark-worker 
and spark-local outputs to specific folders and cleaning them up in case the 
spark context doesn't close successfully.

> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.

[jira] [Created] (SPARK-24081) Spark SQL drops the table while writing into table in "overwrite" mode.

2018-04-24 Thread Ashish (JIRA)

Ashish created SPARK-24081:
--

 Summary: Spark SQL drops the table  while writing into table in 
"overwrite" mode.
 Key: SPARK-24081
 URL: https://issues.apache.org/jira/browse/SPARK-24081
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.3.0
Reporter: Ashish


I am taking data from table and doing  modification to the data once I am 
writing back to table in overwrite mode its deleting all the record.

Expectation: It will update the table with updated data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24036) Stateful operators in continuous processing

2018-04-24 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451653#comment-16451653
 ] 

Jungtaek Lim commented on SPARK-24036:
--

Hello, I'm quite interested to this issue since I just read the codebase in 
recent change of continuous mode and observed same limitations.

Do you have ideas or any design docs for this? Moreover do you plan to share 
these tasks with Spark community? Willing to contribute on this side, but 
that's completely OK if you plan to drive whole tasks from your own.

> Stateful operators in continuous processing
> ---
>
> Key: SPARK-24036
> URL: https://issues.apache.org/jira/browse/SPARK-24036
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>
> The first iteration of continuous processing in Spark 2.3 does not work with 
> stateful operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24080) Update the nullability of Filter output based on inferred predicates

2018-04-24 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-24080:


 Summary: Update the nullability of Filter output based on inferred 
predicates
 Key: SPARK-24080
 URL: https://issues.apache.org/jira/browse/SPARK-24080
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Takeshi Yamamuro


In the master, a logical `Filter` node does not respect the nullability that 
the optimizer rule `InferFiltersFromConstraints`
might change when inferred predicates have `IsNotNull`, e.g.,

{code}
scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
scala> val filteredDf = df.where("a = 3")
scala> val filteredDf.explain
scala> filteredDf.queryExecution.optimizedPlan.children(0)
res4: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Filter (isnotnull(_1#2) && (_1#2 = 3))
+- LocalRelation [_1#2, _2#3]

scala> 
filteredDf.queryExecution.optimizedPlan.children(0).output.map(_.nullable)
res5: Seq[Boolean] = List(true, true)
{code}

But, these `nullable` values should be:

{code}
scala> 
filteredDf.queryExecution.optimizedPlan.children(0).output.map(_.nullable)
res5: Seq[Boolean] = List(false, true)
{code}

This ticket comes from the previous discussion: 
https://github.com/apache/spark/pull/18576#pullrequestreview-107585997




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24079) Update the nullability of Join output based on inferred predicates

2018-04-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451650#comment-16451650
 ] 

Apache Spark commented on SPARK-24079:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21148

> Update the nullability of Join output based on inferred predicates
> --
>
> Key: SPARK-24079
> URL: https://issues.apache.org/jira/browse/SPARK-24079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In the master, a logical `Join` node does not respect the nullability that 
> the optimizer rule `InferFiltersFromConstraints`
> might change when inferred predicates have `IsNotNull`, e.g.,
> {code}
> scala> val df1 = Seq((Some(1), Some(2))).toDF("k", "v0")
> scala> val df2 = Seq((Some(1), Some(3))).toDF("k", "v1")
> scala> val joinedDf = df1.join(df2, df1("k") === df2("k"), "inner")
> scala> joinedDf.explain
> == Physical Plan ==
> *(2) BroadcastHashJoin [k#83], [k#92], Inner, BuildRight
> :- *(2) Project [_1#80 AS k#83, _2#81 AS v0#84]
> :  +- *(2) Filter isnotnull(_1#80)
> : +- LocalTableScan [_1#80, _2#81]
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> true] as bigint)))
>+- *(1) Project [_1#89 AS k#92, _2#90 AS v1#93]
>   +- *(1) Filter isnotnull(_1#89)
>  +- LocalTableScan [_1#89, _2#90]
> scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable)
> res15: Seq[Boolean] = List(true, true, true, true)
> {code}
> But, these `nullable` values should be:
> {code}
> scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable)
> res15: Seq[Boolean] = List(false, true, false, true)
> {code}
> This ticket comes from the previous discussion: 
> https://github.com/apache/spark/pull/18576#pullrequestreview-107585997



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24079) Update the nullability of Join output based on inferred predicates

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24079:


Assignee: (was: Apache Spark)

> Update the nullability of Join output based on inferred predicates
> --
>
> Key: SPARK-24079
> URL: https://issues.apache.org/jira/browse/SPARK-24079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In the master, a logical `Join` node does not respect the nullability that 
> the optimizer rule `InferFiltersFromConstraints`
> might change when inferred predicates have `IsNotNull`, e.g.,
> {code}
> scala> val df1 = Seq((Some(1), Some(2))).toDF("k", "v0")
> scala> val df2 = Seq((Some(1), Some(3))).toDF("k", "v1")
> scala> val joinedDf = df1.join(df2, df1("k") === df2("k"), "inner")
> scala> joinedDf.explain
> == Physical Plan ==
> *(2) BroadcastHashJoin [k#83], [k#92], Inner, BuildRight
> :- *(2) Project [_1#80 AS k#83, _2#81 AS v0#84]
> :  +- *(2) Filter isnotnull(_1#80)
> : +- LocalTableScan [_1#80, _2#81]
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> true] as bigint)))
>+- *(1) Project [_1#89 AS k#92, _2#90 AS v1#93]
>   +- *(1) Filter isnotnull(_1#89)
>  +- LocalTableScan [_1#89, _2#90]
> scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable)
> res15: Seq[Boolean] = List(true, true, true, true)
> {code}
> But, these `nullable` values should be:
> {code}
> scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable)
> res15: Seq[Boolean] = List(false, true, false, true)
> {code}
> This ticket comes from the previous discussion: 
> https://github.com/apache/spark/pull/18576#pullrequestreview-107585997



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24079) Update the nullability of Join output based on inferred predicates

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24079:


Assignee: Apache Spark

> Update the nullability of Join output based on inferred predicates
> --
>
> Key: SPARK-24079
> URL: https://issues.apache.org/jira/browse/SPARK-24079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> In the master, a logical `Join` node does not respect the nullability that 
> the optimizer rule `InferFiltersFromConstraints`
> might change when inferred predicates have `IsNotNull`, e.g.,
> {code}
> scala> val df1 = Seq((Some(1), Some(2))).toDF("k", "v0")
> scala> val df2 = Seq((Some(1), Some(3))).toDF("k", "v1")
> scala> val joinedDf = df1.join(df2, df1("k") === df2("k"), "inner")
> scala> joinedDf.explain
> == Physical Plan ==
> *(2) BroadcastHashJoin [k#83], [k#92], Inner, BuildRight
> :- *(2) Project [_1#80 AS k#83, _2#81 AS v0#84]
> :  +- *(2) Filter isnotnull(_1#80)
> : +- LocalTableScan [_1#80, _2#81]
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> true] as bigint)))
>+- *(1) Project [_1#89 AS k#92, _2#90 AS v1#93]
>   +- *(1) Filter isnotnull(_1#89)
>  +- LocalTableScan [_1#89, _2#90]
> scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable)
> res15: Seq[Boolean] = List(true, true, true, true)
> {code}
> But, these `nullable` values should be:
> {code}
> scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable)
> res15: Seq[Boolean] = List(false, true, false, true)
> {code}
> This ticket comes from the previous discussion: 
> https://github.com/apache/spark/pull/18576#pullrequestreview-107585997



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24079) Update the nullability of Join output based on inferred predicates

2018-04-24 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-24079:


 Summary: Update the nullability of Join output based on inferred 
predicates
 Key: SPARK-24079
 URL: https://issues.apache.org/jira/browse/SPARK-24079
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Takeshi Yamamuro



In the master, a logical `Join` node does not respect the nullability that the 
optimizer rule `InferFiltersFromConstraints`
might change when inferred predicates have `IsNotNull`, e.g.,

{code}
scala> val df1 = Seq((Some(1), Some(2))).toDF("k", "v0")
scala> val df2 = Seq((Some(1), Some(3))).toDF("k", "v1")
scala> val joinedDf = df1.join(df2, df1("k") === df2("k"), "inner")
scala> joinedDf.explain
== Physical Plan ==
*(2) BroadcastHashJoin [k#83], [k#92], Inner, BuildRight
:- *(2) Project [_1#80 AS k#83, _2#81 AS v0#84]
:  +- *(2) Filter isnotnull(_1#80)
: +- LocalTableScan [_1#80, _2#81]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] 
as bigint)))
   +- *(1) Project [_1#89 AS k#92, _2#90 AS v1#93]
  +- *(1) Filter isnotnull(_1#89)
 +- LocalTableScan [_1#89, _2#90]

scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable)
res15: Seq[Boolean] = List(true, true, true, true)
{code}

But, these `nullable` values should be:

{code}
scala> joinedDf.queryExecution.optimizedPlan.output.map(_.nullable)
res15: Seq[Boolean] = List(false, true, false, true)
{code}

This ticket comes from the previous discussion: 
https://github.com/apache/spark/pull/18576#pullrequestreview-107585997




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-24 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24070:
---

Assignee: Takeshi Yamamuro

> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24078) reduce with unionAll takes a long time

2018-04-24 Thread zhangsongcheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangsongcheng updated SPARK-24078:
---
Description: 
I try to sample the traning sets with each category，and then uion all samples 
together.This is my code:

def balance4Single(dataSet: DataFrame): DataFrame = {

  val samples = LabelConf.cardIDList.map { cardID =>

  val tmpDataSet = dataSet.filter(col("card_id") === cardID)
  val sample = underSample(tmpDataSet, cardID)

  sample
}
  samples.reduce((x, y) => x.unionAll(y))
} 

def underSample(dataSet: DataFrame, cardID: String): DataFrame = {
  val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1) 
  val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1) 

  positiveSample.unionAll(negativeSample).distinct()
}


 But the code blocked in {{samples.reduce((x, y) => x.unionAll(y))}}, and it 
runs slowly and slowly, and even cannot run any more.It confused me a long 
time.Who can help me? Than you!

  was:
I try to sample the traning sets with each category，and then uion all samples 
together.This is my code:
 {{  def balanceCategory(dataSet: DataFrame): DataFrame = }}

{{{}}

        val samples = LabelConf.categories.map {

            category => 
{{        val tmpDataSet = dataSet.filter(col("category_id") === category)}}
                val sample = underSample(tmpDataSet, category)

              sample

    }
 {{      samples.reduce((x, y) => x.unionAll(y))}}
     }
  
 {{  def underSample(dataSet: DataFrame, cardID: String): DataFrame = {      
val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}}
 {{    val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 
0.1)}}
 {{    val positiveSample.unionAll(negativeSample)}}
     }
  
 But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it 
runs slowly and slowly, and even cannot run any more.It confused me a long 
time.Who can help me? Than you!


> reduce with unionAll takes a long time
> --
>
> Key: SPARK-24078
> URL: https://issues.apache.org/jira/browse/SPARK-24078
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.3
>Reporter: zhangsongcheng
>Priority: Major
>
> I try to sample the traning sets with each category，and then uion all samples 
> together.This is my code:
> def balance4Single(dataSet: DataFrame): DataFrame = {
>   val samples = LabelConf.cardIDList.map { cardID =>
>   val tmpDataSet = dataSet.filter(col("card_id") === cardID)
>   val sample = underSample(tmpDataSet, cardID)
>   sample
> }
>   samples.reduce((x, y) => x.unionAll(y))
> } 
> def underSample(dataSet: DataFrame, cardID: String): DataFrame = {
>   val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1) 
>   val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 0.1) 
>   positiveSample.unionAll(negativeSample).distinct()
> }
>  But the code blocked in {{samples.reduce((x, y) => x.unionAll(y))}}, and it 
> runs slowly and slowly, and even cannot run any more.It confused me a long 
> time.Who can help me? Than you!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23799) [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics

2018-04-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451609#comment-16451609
 ] 

Apache Spark commented on SPARK-23799:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/21147

> [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of 
> empty table with analyzed statistics
> 
>
> Key: SPARK-23799
> URL: https://issues.apache.org/jira/browse/SPARK-23799
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Michael Shtelma
>Assignee: Michael Shtelma
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.2.1 and 2.3.0 can produce NumberFormatException (see below) during 
> the analysis of the queries, which are using previously analyzed hive tables. 
> The NumberFormatException occurs because in [FilterEstimation.scala on lines 
> 50 and 
> 52|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala?utf8=%E2%9C%93#L50-L52]
>  the method calculateFilterSelectivity returns NaN, which is caused by 
> devision by zero. This leads to NumberFormatException during conversion from 
> Double to BigDecimal. 
> NaN is caused by devision by zero in evaluateInSet method. 
> Exception:
> java.lang.NumberFormatException
> at java.math.BigDecimal.(BigDecimal.java:494)
> at java.math.BigDecimal.(BigDecimal.java:824)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:52)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:55)
> at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:343)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation.estimate(FilterEstimation.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:43)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
> at 
> scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
> at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$.rowCountsExist(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.ProjectEstimation$.estimate(ProjectEstimation.scala:27)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:63)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:37)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.c

[jira] [Updated] (SPARK-24078) reduce with unionAll takes a long time

2018-04-24 Thread zhangsongcheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangsongcheng updated SPARK-24078:
---
Description: 
I try to sample the traning sets with each category，and then uion all samples 
together.This is my code:
 {{  def balanceCategory(dataSet: DataFrame): DataFrame = }}

{{{}}

        val samples = LabelConf.categories.map {

            category => 
{{        val tmpDataSet = dataSet.filter(col("category_id") === category)}}
                val sample = underSample(tmpDataSet, category)

              sample

    }
 {{      samples.reduce((x, y) => x.unionAll(y))}}
     }
  
 {{  def underSample(dataSet: DataFrame, cardID: String): DataFrame = {      
val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}}
 {{    val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 
0.1)}}
 {{    val positiveSample.unionAll(negativeSample)}}
     }
  
 But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it 
runs slowly and slowly, and even cannot run any more.It confused me a long 
time.Who can help me? Than you!

  was:
I try to sample the traning sets with each category，and then uion all samples 
together.This is my code:
 {{  def balanceCategory(dataSet: DataFrame): DataFrame = {}}
 {{    val samples = LabelConf.categorys.map { }}category => 
 {{      val tmpDataSet = dataSet.filter(col("category_id") === category)}}
             val sample = underSample(tmpDataSet, category)

            sample

    }
{{    samples.reduce((x, y) => x.unionAll(y))}}
    }
  
 {{  def underSample(dataSet: DataFrame, cardID: String): DataFrame = {      
val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}}
 {{    val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 
0.1)}}
 {{    val positiveSample.unionAll(negativeSample)}}
     }
  
 But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it 
runs slowly and slowly, and even cannot run any more.It confused me a long 
time.Who can help me? Than you!


> reduce with unionAll takes a long time
> --
>
> Key: SPARK-24078
> URL: https://issues.apache.org/jira/browse/SPARK-24078
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.3
>Reporter: zhangsongcheng
>Priority: Major
>
> I try to sample the traning sets with each category，and then uion all samples 
> together.This is my code:
>  {{  def balanceCategory(dataSet: DataFrame): DataFrame = }}
> {{{}}
>         val samples = LabelConf.categories.map {
>             category => 
> {{        val tmpDataSet = dataSet.filter(col("category_id") === category)}}
>                 val sample = underSample(tmpDataSet, category)
>               sample
>     }
>  {{      samples.reduce((x, y) => x.unionAll(y))}}
>      }
>   
>  {{  def underSample(dataSet: DataFrame, cardID: String): DataFrame = {      
> val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}}
>  {{    val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 
> 0.1)}}
>  {{    val positiveSample.unionAll(negativeSample)}}
>      }
>   
>  But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and 
> it runs slowly and slowly, and even cannot run any more.It confused me a long 
> time.Who can help me? Than you!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24078) reduce with unionAll takes a long time

2018-04-24 Thread zhangsongcheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhangsongcheng updated SPARK-24078:
---
Description: 
I try to sample the traning sets with each category，and then uion all samples 
together.This is my code:
 {{  def balanceCategory(dataSet: DataFrame): DataFrame = {}}
 {{    val samples = LabelConf.categorys.map { }}category => 
 {{      val tmpDataSet = dataSet.filter(col("category_id") === category)}}
             val sample = underSample(tmpDataSet, category)

            sample

    }
{{    samples.reduce((x, y) => x.unionAll(y))}}
    }
  
 {{  def underSample(dataSet: DataFrame, cardID: String): DataFrame = {      
val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}}
 {{    val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 
0.1)}}
 {{    val positiveSample.unionAll(negativeSample)}}
     }
  
 But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it 
runs slowly and slowly, and even cannot run any more.It confused me a long 
time.Who can help me? Than you!

  was:
I try to sample the traning sets with each category，and then uion all samples 
together.This is my code:
{{  def balanceCategory(dataSet: DataFrame): DataFrame = {}}
{{    val samples = LabelConf.categorys.map { }}{{category => }}
{{      val tmpDataSet = dataSet.filter(col("category_id") === category)}}
{{      val sample = underSample(tmpDataSet, category) sample }}
{{    } }}
{{    samples.reduce((x, y) => x.unionAll(y))}}
{{  } }}
 
{{  def underSample(dataSet: DataFrame, cardID: String): DataFrame = {      val 
positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}}
{{    val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 
0.1)}}
{{    val positiveSample.unionAll(negativeSample)}}
    }
 
But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it 
runs slowly and slowly, and even cannot run any more.It confused me a long 
time.Who can help me? Than you!


> reduce with unionAll takes a long time
> --
>
> Key: SPARK-24078
> URL: https://issues.apache.org/jira/browse/SPARK-24078
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.6.3
>Reporter: zhangsongcheng
>Priority: Major
>
> I try to sample the traning sets with each category，and then uion all samples 
> together.This is my code:
>  {{  def balanceCategory(dataSet: DataFrame): DataFrame = {}}
>  {{    val samples = LabelConf.categorys.map { }}category => 
>  {{      val tmpDataSet = dataSet.filter(col("category_id") === category)}}
>              val sample = underSample(tmpDataSet, category)
>             sample
>     }
> {{    samples.reduce((x, y) => x.unionAll(y))}}
>     }
>   
>  {{  def underSample(dataSet: DataFrame, cardID: String): DataFrame = {      
> val positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}}
>  {{    val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 
> 0.1)}}
>  {{    val positiveSample.unionAll(negativeSample)}}
>      }
>   
>  But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and 
> it runs slowly and slowly, and even cannot run any more.It confused me a long 
> time.Who can help me? Than you!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24078) reduce with unionAll takes a long time

2018-04-24 Thread zhangsongcheng (JIRA)

zhangsongcheng created SPARK-24078:
--

 Summary: reduce with unionAll takes a long time
 Key: SPARK-24078
 URL: https://issues.apache.org/jira/browse/SPARK-24078
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.6.3
Reporter: zhangsongcheng


I try to sample the traning sets with each category，and then uion all samples 
together.This is my code:
{{  def balanceCategory(dataSet: DataFrame): DataFrame = {}}
{{    val samples = LabelConf.categorys.map { }}{{category => }}
{{      val tmpDataSet = dataSet.filter(col("category_id") === category)}}
{{      val sample = underSample(tmpDataSet, category) sample }}
{{    } }}
{{    samples.reduce((x, y) => x.unionAll(y))}}
{{  } }}
 
{{  def underSample(dataSet: DataFrame, cardID: String): DataFrame = {      val 
positiveSample = dataSet.filter(col("label") > 0.5).sample(false, 0.1)}}
{{    val negativeSample = dataSet.filter(col("label") < 0.5).sample(false, 
0.1)}}
{{    val positiveSample.unionAll(negativeSample)}}
    }
 
But the code blocked in `{{samples.reduce((x, y) => x.unionAll(y))`}}, and it 
runs slowly and slowly, and even cannot run any more.It confused me a long 
time.Who can help me? Than you!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451540#comment-16451540
 ] 

yucai commented on SPARK-24076:
---

shuffle.partition = 8192

!p1.png!

shuffle.partition = 8000

!p2.png!

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yucai updated SPARK-24076:
--
Attachment: p2.png
p1.png

> very bad performance when shuffle.partition = 8192
> --
>
> Key: SPARK-24076
> URL: https://issues.apache.org/jira/browse/SPARK-24076
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: yucai
>Priority: Major
> Attachments: p1.png, p2.png
>
>
> We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24077) Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?

2018-04-24 Thread Benedict Jin (JIRA)

Benedict Jin created SPARK-24077:


 Summary: Why spark SQL not support `CREATE TEMPORARY FUNCTION IF 
NOT EXISTS`?
 Key: SPARK-24077
 URL: https://issues.apache.org/jira/browse/SPARK-24077
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.3.0
Reporter: Benedict Jin
 Fix For: 3.0.0


Why spark SQL not support `CREATE TEMPORARY FUNCTION IF NOT EXISTS`?

 

scala> 
org.apache.spark.sql.SparkSession.builder().enableHiveSupport.getOrCreate.sql("CREATE
 TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'")

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'NOT' expecting \{'.', 'AS'}(line 1, pos 29)

== SQL ==
 CREATE TEMPORARY FUNCTION IF NOT EXISTS yuzhouwan as 
'org.apache.spark.sql.hive.udf.YuZhouWan'
 -^^^
 at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
 at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
 at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
 ... 48 elided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24076) very bad performance when shuffle.partition = 8192

2018-04-24 Thread yucai (JIRA)

yucai created SPARK-24076:
-

 Summary: very bad performance when shuffle.partition = 8192
 Key: SPARK-24076
 URL: https://issues.apache.org/jira/browse/SPARK-24076
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: yucai


We see very bad performance when shuffle.partition = 8192 on some cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23821) High-order function: flatten(x) → array

2018-04-24 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23821.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20938
[https://github.com/apache/spark/pull/20938]

> High-order function: flatten(x) → array
> ---
>
> Key: SPARK-23821
> URL: https://issues.apache.org/jira/browse/SPARK-23821
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Add the flatten function that transforms an Array of Arrays column into an 
> Array elements column. if the array structure contains more than two levels 
> of nesting, the function removes one nesting level
> Example:
> {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => 
> [1,2,3,4,5,6,7,8,9]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23821) High-order function: flatten(x) → array

2018-04-24 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23821:
-

Assignee: Marek Novotny

> High-order function: flatten(x) → array
> ---
>
> Key: SPARK-23821
> URL: https://issues.apache.org/jira/browse/SPARK-23821
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Major
>
> Add the flatten function that transforms an Array of Arrays column into an 
> Array elements column. if the array structure contains more than two levels 
> of nesting, the function removes one nesting level
> Example:
> {{flatten(array(array(1, 2, 3), array(3, 4, 5), array(6, 7, 8)) => 
> [1,2,3,4,5,6,7,8,9]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24075) [Mesos] Supervised driver upon failure will be retried indefinitely unless explicitly killed

2018-04-24 Thread Yogesh Natarajan (JIRA)

Yogesh Natarajan created SPARK-24075:


 Summary: [Mesos] Supervised driver upon failure will be retried 
indefinitely unless explicitly killed
 Key: SPARK-24075
 URL: https://issues.apache.org/jira/browse/SPARK-24075
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 2.3.0
Reporter: Yogesh Natarajan


If supervise is enabled, MesosClusterScheduler will retry a failing driver 
indefinitely. This takes up cluster resources which is freed up only when the 
driver is explicitly killed.

The proposed solution is to introduce spark configuration 
"spark.driver.supervise.maxRetries" which allows the maximum number of retries 
to be specified while preserving the default behavior of retrying the driver 
indefinitely.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24074) Maven package resolver downloads javadoc instead of jar

2018-04-24 Thread Nadav Samet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451499#comment-16451499
 ] 

Nadav Samet commented on SPARK-24074:
-

I was only able to reproduce this problem with this particular dependency, 
whether I request breeze, my own package that depends on breeze, or directly 
this package - it always fails to download it

> Maven package resolver downloads javadoc instead of jar
> ---
>
> Key: SPARK-24074
> URL: https://issues.apache.org/jira/browse/SPARK-24074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Nadav Samet
>Priority: Major
>
> {code:java}
> // code placeholder
> {code}
> From some reason spark downloads a javadoc artifact of a package instead of 
> the jar.
> Steps to reproduce:
>  # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
> fetch artifacts from central:
> {code:java}
> rm -rf ~/.ivy2
> {code}
> 1. Run:
> {code:java}
> ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
> org.scalanlp:breeze_2.11:0.13.2{code}
> 2.Spark would download the javadoc instead of the jar:
> {code:java}
> downloading 
> https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
>  ...
> [SUCCESSFUL ] 
> net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
> (610ms){code}
> 3. Later spark would complain that it couldn't find the jar:
> {code:java}
> Warning: Local jar 
> /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar 
> does not exist, skipping.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).{code}
> 4. The dependency of breeze on f2j_arpack_combined seem fine: 
> [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24064) [Spark SQL] Create table using csv does not support binary column Type

2018-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24064:
-
Target Version/s:   (was: 2.3.1)

Please avoid to set a target version which is usually set by a committer. Yea, 
I know this limitation. Would you have an idea to support this type?

> [Spark SQL] Create table  using csv does not support binary column Type
> ---
>
> Key: SPARK-24064
> URL: https://issues.apache.org/jira/browse/SPARK-24064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS Type: Suse 11
> Spark Version: 2.3.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>  Labels: test
>
> #  Launch spark-sql --master yarn                                         
>  # create table csvTable (time timestamp, name string, isright boolean, 
> datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) using 
> CSV options (path "/user/datatmo/customer1.csv");
> Result: Table creation is successful
>     3. Select * from csvTable;
> Throws below Exception
> ERROR SparkSQLDriver:91 - Failed in [select * from csvtable]
> java.lang.UnsupportedOperationException: *CSV data source does not support 
> binary data type*.
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.org$apache$spark$sql$execution$datasources$csv$CSVUtils$$verifyType$1(CSVUtils.scala:127)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$$anonfun$verifySchema$1.apply(CSVUtils.scala:131)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
>  
> But Normal table supports binary Data Type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-24 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451483#comment-16451483
 ] 

Hyukjin Kwon commented on SPARK-24068:
--

Hm, [~maxgekk], btw is this specific to CSV (not, for example JSON)?

> CSV schema inferring doesn't work for compressed files
> --
>
> Key: SPARK-24068
> URL: https://issues.apache.org/jira/browse/SPARK-24068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Here is a simple csv file compressed by lzo
> {code}
> $ cat ./test.csv
> col1,col2
> a,1
> $ lzop ./test.csv
> $ ls
> test.csv test.csv.lzo
> {code}
> Reading test.csv.lzo with LZO codec (see 
> https://github.com/twitter/hadoop-lzo, for example):
> {code:scala}
> scala> val ds = spark.read.option("header", true).option("inferSchema", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
> ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
> scala> ds.printSchema
> root
>  |-- �LZO: string (nullable = true)
> scala> ds.show
> +-+
> |�LZO|
> +-+
> |a|
> +-+
> {code}
> but the file can be read if the schema is specified:
> {code}
> scala> import org.apache.spark.sql.types._
> scala> val schema = new StructType().add("col1", StringType).add("col2", 
> IntegerType)
> scala> val ds = spark.read.schema(schema).option("header", 
> true).option("io.compression.codecs", 
> "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
> scala> ds.show
> +++
> |col1|col2|
> +++
> |   a|   1|
> +++
> {code}
> Just in case, schema inferring works for the original uncompressed file:
> {code:scala}
> scala> spark.read.option("header", true).option("inferSchema", 
> true).csv("test.csv").printSchema
> root
>  |-- col1: string (nullable = true)
>  |-- col2: integer (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24074) Maven package resolver downloads javadoc instead of jar

2018-04-24 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-24074:
-
Priority: Major  (was: Critical)

Please avoid to set Critical+ which is usually reserved for committers.  Does 
this consistently happen with other packages too?

> Maven package resolver downloads javadoc instead of jar
> ---
>
> Key: SPARK-24074
> URL: https://issues.apache.org/jira/browse/SPARK-24074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Nadav Samet
>Priority: Major
>
> {code:java}
> // code placeholder
> {code}
> From some reason spark downloads a javadoc artifact of a package instead of 
> the jar.
> Steps to reproduce:
>  # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
> fetch artifacts from central:
> {code:java}
> rm -rf ~/.ivy2
> {code}
> 1. Run:
> {code:java}
> ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
> org.scalanlp:breeze_2.11:0.13.2{code}
> 2.Spark would download the javadoc instead of the jar:
> {code:java}
> downloading 
> https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
>  ...
> [SUCCESSFUL ] 
> net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
> (610ms){code}
> 3. Later spark would complain that it couldn't find the jar:
> {code:java}
> Warning: Local jar 
> /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar 
> does not exist, skipping.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).{code}
> 4. The dependency of breeze on f2j_arpack_combined seem fine: 
> [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-24 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451466#comment-16451466
 ] 

Takeshi Yamamuro commented on SPARK-24070:
--

ok

> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24038) refactor continuous write exec to its own class

2018-04-24 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24038.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21116
[https://github.com/apache/spark/pull/21116]

> refactor continuous write exec to its own class
> ---
>
> Key: SPARK-24038
> URL: https://issues.apache.org/jira/browse/SPARK-24038
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24038) refactor continuous write exec to its own class

2018-04-24 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-24038:
-

Assignee: Jose Torres

> refactor continuous write exec to its own class
> ---
>
> Key: SPARK-24038
> URL: https://issues.apache.org/jira/browse/SPARK-24038
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-24 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451427#comment-16451427
 ] 

Xiao Li commented on SPARK-24070:
-

Yeah, please do it here. Thanks! If you have the bandwidth to write the micro 
benchmark suite, that needs a separate PR. 

> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-24 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451404#comment-16451404
 ] 

Takeshi Yamamuro commented on SPARK-24070:
--

ok, this ticket means we will put the performance results here instead of pr?

> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24074) Maven package resolver downloads javadoc instead of jar

2018-04-24 Thread Nadav Samet (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nadav Samet updated SPARK-24074:

Environment: (was: {code:java}
// code placeholder
{code})

> Maven package resolver downloads javadoc instead of jar
> ---
>
> Key: SPARK-24074
> URL: https://issues.apache.org/jira/browse/SPARK-24074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Nadav Samet
>Priority: Critical
>
> {code:java}
> // code placeholder
> {code}
> From some reason spark downloads a javadoc artifact of a package instead of 
> the jar.
> Steps to reproduce:
>  # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
> fetch artifacts from central:
> {code:java}
> rm -rf ~/.ivy2
> {code}
> 1. Run:
> {code:java}
> ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
> org.scalanlp:breeze_2.11:0.13.2{code}
> 2.Spark would download the javadoc instead of the jar:
> {code:java}
> downloading 
> https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
>  ...
> [SUCCESSFUL ] 
> net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
> (610ms){code}
> 3. Later spark would complain that it couldn't find the jar:
> {code:java}
> Warning: Local jar 
> /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar 
> does not exist, skipping.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).{code}
> 4. The dependency of breeze on f2j_arpack_combined seem fine: 
> [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24074) Maven package resolver downloads javadoc instead of jar

2018-04-24 Thread Nadav Samet (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nadav Samet updated SPARK-24074:

Description: 
{code:java}
// code placeholder
{code}
>From some reason spark downloads a javadoc artifact of a package instead of 
>the jar.

Steps to reproduce:
 # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
fetch artifacts from central:

{code:java}
rm -rf ~/.ivy2
{code}
1. Run:
{code:java}
~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
org.scalanlp:breeze_2.11:0.13.2{code}
2.Spark would download the javadoc instead of the jar:
{code:java}
downloading 
https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
 ...
[SUCCESSFUL ] 
net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
(610ms){code}
3. Later spark would complain that it couldn't find the jar:
{code:java}
Warning: Local jar 
/Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar does 
not exist, skipping.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).{code}
4. The dependency of breeze on f2j_arpack_combined seem fine: 
[http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]

 

  was:
{code:java}
// code placeholder
{code}
>From some reason spark downloads a javadoc artifact of a package instead of 
>the jar.

Steps to reproduce:
 # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
fetch artifacts from central:

{code:java}
rm -rf ~/.ivy2
{code}

 # Run:

{code:java}
~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
org.scalanlp:breeze_2.11:0.13.2{code}

 # Spark would download the javadoc instead of the jar:
{code:java}
downloading 
https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
 ...
[SUCCESSFUL ] 
net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
(610ms){code}

 # Later spark would complain that it couldn't find the jar:

{code:java}
Warning: Local jar 
/Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar does 
not exist, skipping.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).{code}

 # The dependency of breeze on f2j_arpack_combined seem fine: 
[http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]

 


> Maven package resolver downloads javadoc instead of jar
> ---
>
> Key: SPARK-24074
> URL: https://issues.apache.org/jira/browse/SPARK-24074
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
> Environment: {code:java}
> // code placeholder
> {code}
>Reporter: Nadav Samet
>Priority: Critical
>
> {code:java}
> // code placeholder
> {code}
> From some reason spark downloads a javadoc artifact of a package instead of 
> the jar.
> Steps to reproduce:
>  # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
> fetch artifacts from central:
> {code:java}
> rm -rf ~/.ivy2
> {code}
> 1. Run:
> {code:java}
> ~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
> org.scalanlp:breeze_2.11:0.13.2{code}
> 2.Spark would download the javadoc instead of the jar:
> {code:java}
> downloading 
> https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
>  ...
> [SUCCESSFUL ] 
> net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
> (610ms){code}
> 3. Later spark would complain that it couldn't find the jar:
> {code:java}
> Warning: Local jar 
> /Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar 
> does not exist, skipping.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).{code}
> 4. The dependency of breeze on f2j_arpack_combined seem fine: 
> [http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24074) Maven package resolver downloads javadoc instead of jar

2018-04-24 Thread Nadav Samet (JIRA)

Nadav Samet created SPARK-24074:
---

 Summary: Maven package resolver downloads javadoc instead of jar
 Key: SPARK-24074
 URL: https://issues.apache.org/jira/browse/SPARK-24074
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
 Environment: {code:java}
// code placeholder
{code}
Reporter: Nadav Samet


{code:java}
// code placeholder
{code}
>From some reason spark downloads a javadoc artifact of a package instead of 
>the jar.

Steps to reproduce:
 # Delete (or move) your local ~/.ivy2 cache to force Spark to resolve and 
fetch artifacts from central:

{code:java}
rm -rf ~/.ivy2
{code}

 # Run:

{code:java}
~/dev/spark-2.3.0-bin-hadoop2.7/bin/spark-shell --packages 
org.scalanlp:breeze_2.11:0.13.2{code}

 # Spark would download the javadoc instead of the jar:
{code:java}
downloading 
https://repo1.maven.org/maven2/net/sourceforge/f2j/arpack_combined_all/0.1/arpack_combined_all-0.1-javadoc.jar
 ...
[SUCCESSFUL ] 
net.sourceforge.f2j#arpack_combined_all;0.1!arpack_combined_all.jar 
(610ms){code}

 # Later spark would complain that it couldn't find the jar:

{code:java}
Warning: Local jar 
/Users/thesamet/.ivy2/jars/net.sourceforge.f2j_arpack_combined_all-0.1.jar does 
not exist, skipping.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).{code}

 # The dependency of breeze on f2j_arpack_combined seem fine: 
[http://central.maven.org/maven2/org/scalanlp/breeze_2.11/0.13.2/breeze_2.11-0.13.2.pom]

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23852) Parquet MR bug can lead to incorrect SQL results

2018-04-24 Thread Henry Robinson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451263#comment-16451263
 ] 

Henry Robinson commented on SPARK-23852:


Yes it has - the Parquet community are going to do a 1.8.3 release, mostly just 
for us for this issue. Parquet 1.10 has already been released, and includes 
this fix. Upgrading Spark trunk to that version is the subject of SPARK-23972.

> Parquet MR bug can lead to incorrect SQL results
> 
>
> Key: SPARK-23852
> URL: https://issues.apache.org/jira/browse/SPARK-23852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Henry Robinson
>Priority: Blocker
>  Labels: correctness
>
> Parquet MR 1.9.0 and 1.8.2 both have a bug, PARQUET-1217, that means that 
> pushing certain predicates to Parquet scanners can return fewer results than 
> they should.
> The bug triggers in Spark when:
>  * The Parquet file being scanner has stats for the null count, but not the 
> max or min on the column with the predicate (Apache Impala writes files like 
> this).
>  * The vectorized Parquet reader path is not taken, and the parquet-mr reader 
> is used.
>  * A suitable <, <=, > or >= predicate is pushed down to Parquet.
> The bug is that the parquet-mr interprets the max and min of a row-group's 
> column as 0 in the absence of stats. So {{col > 0}} will filter all results, 
> even if some are > 0.
> There is no upstream release of Parquet that contains the fix for 
> PARQUET-1217, although a 1.10 release is planned.
> The least impactful workaround is to set the Parquet configuration 
> {{parquet.filter.stats.enabled}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24056) Make consumer creation lazy in Kafka source for Structured streaming

2018-04-24 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-24056.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 21134
[https://github.com/apache/spark/pull/21134]

> Make consumer creation lazy in Kafka source for Structured streaming
> 
>
> Key: SPARK-24056
> URL: https://issues.apache.org/jira/browse/SPARK-24056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, the driver side of the Kafka source (i.e. KafkaMicroBatchReader) 
> eagerly creates a consumer as soon as the Kafk aMicroBatchReader is created. 
> However, we create dummy KafkaMicroBatchReader to get the schema and 
> immediately stop it. Its better to make the consumer creation lazy, it will 
> be created on the first attempt to fetch offsets using the KafkaOffsetReader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0

2018-04-24 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16451241#comment-16451241
 ] 

Marco Gaido commented on SPARK-24051:
-

[~hvanhovell] I am not sure that the analysis barriers are the root cause. The 
issue is that using the very same list of cols in the two selects, the two 
different Alias have the same exprId. I am not sure this case is supposed to be 
fixed in ResolveReferences (or another place), but it is not because of the 
introduction of AnalysisBarrier, or if it was just supposed not to be possible.

> Incorrect results for certain queries using Java and Python APIs on Spark 
> 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-04-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20114:
-

Assignee: Weichen Xu

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Weichen Xu
>Priority: Major
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23654) Cut jets3t as a dependency of spark-core

2018-04-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450449#comment-16450449
 ] 

Apache Spark commented on SPARK-23654:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/21146

> Cut jets3t as a dependency of spark-core
> 
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23654) Cut jets3t as a dependency of spark-core

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23654:


Assignee: Apache Spark

> Cut jets3t as a dependency of spark-core
> 
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-04-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20114:
--
Shepherd: Joseph K. Bradley

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Major
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23654) Cut jets3t as a dependency of spark-core

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23654:


Assignee: (was: Apache Spark)

> Cut jets3t as a dependency of spark-core
> 
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-04-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20114:
--
Target Version/s: 2.4.0

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Major
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24073) DataSourceV2: Rename DataReaderFactory back to ReadTask.

2018-04-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450443#comment-16450443
 ] 

Apache Spark commented on SPARK-24073:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21145

> DataSourceV2: Rename DataReaderFactory back to ReadTask.
> 
>
> Key: SPARK-24073
> URL: https://issues.apache.org/jira/browse/SPARK-24073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> Just before 2.3.0, SPARK-23219 renamed ReadTask to DataReaderFactory. The 
> intent was to make the read and write API match (write side uses 
> DataWriterFactory), but the underlying problem is that the two classes are 
> not equivalent.
> ReadTask/DataReader function as Iterable/Iterator. ReadTask is a specific to 
> a read task, in contrast to DataWriterFactory where the same factory instance 
> is used in all write tasks. ReadTask's purpose is to manage the lifecycle of 
> DataReader with an explicit create operation to mirror the close operation. 
> This is no longer clear from the API, where DataReaderFactory appears to be 
> more generic than it is and it isn't clear why a set of them is produced for 
> a read.
> We should rename DataReaderFactory back to ReadTask, which correctly conveys 
> the purpose and use of the class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24073) DataSourceV2: Rename DataReaderFactory back to ReadTask.

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24073:


Assignee: Apache Spark

> DataSourceV2: Rename DataReaderFactory back to ReadTask.
> 
>
> Key: SPARK-24073
> URL: https://issues.apache.org/jira/browse/SPARK-24073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
> Fix For: 2.4.0
>
>
> Just before 2.3.0, SPARK-23219 renamed ReadTask to DataReaderFactory. The 
> intent was to make the read and write API match (write side uses 
> DataWriterFactory), but the underlying problem is that the two classes are 
> not equivalent.
> ReadTask/DataReader function as Iterable/Iterator. ReadTask is a specific to 
> a read task, in contrast to DataWriterFactory where the same factory instance 
> is used in all write tasks. ReadTask's purpose is to manage the lifecycle of 
> DataReader with an explicit create operation to mirror the close operation. 
> This is no longer clear from the API, where DataReaderFactory appears to be 
> more generic than it is and it isn't clear why a set of them is produced for 
> a read.
> We should rename DataReaderFactory back to ReadTask, which correctly conveys 
> the purpose and use of the class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24073) DataSourceV2: Rename DataReaderFactory back to ReadTask.

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24073:


Assignee: (was: Apache Spark)

> DataSourceV2: Rename DataReaderFactory back to ReadTask.
> 
>
> Key: SPARK-24073
> URL: https://issues.apache.org/jira/browse/SPARK-24073
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
> Fix For: 2.4.0
>
>
> Just before 2.3.0, SPARK-23219 renamed ReadTask to DataReaderFactory. The 
> intent was to make the read and write API match (write side uses 
> DataWriterFactory), but the underlying problem is that the two classes are 
> not equivalent.
> ReadTask/DataReader function as Iterable/Iterator. ReadTask is a specific to 
> a read task, in contrast to DataWriterFactory where the same factory instance 
> is used in all write tasks. ReadTask's purpose is to manage the lifecycle of 
> DataReader with an explicit create operation to mirror the close operation. 
> This is no longer clear from the API, where DataReaderFactory appears to be 
> more generic than it is and it isn't clear why a set of them is produced for 
> a read.
> We should rename DataReaderFactory back to ReadTask, which correctly conveys 
> the purpose and use of the class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23654) Cut jets3t as a dependency of spark-core

2018-04-24 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23654:
---
Summary: Cut jets3t as a dependency of spark-core  (was: Cut jets3t as a 
dependency of spark-core; exclude it from hadoop-cloud module as incompatible)

> Cut jets3t as a dependency of spark-core
> 
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24073) DataSourceV2: Rename DataReaderFactory back to ReadTask.

2018-04-24 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-24073:
-

 Summary: DataSourceV2: Rename DataReaderFactory back to ReadTask.
 Key: SPARK-24073
 URL: https://issues.apache.org/jira/browse/SPARK-24073
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue
 Fix For: 2.4.0


Just before 2.3.0, SPARK-23219 renamed ReadTask to DataReaderFactory. The 
intent was to make the read and write API match (write side uses 
DataWriterFactory), but the underlying problem is that the two classes are not 
equivalent.

ReadTask/DataReader function as Iterable/Iterator. ReadTask is a specific to a 
read task, in contrast to DataWriterFactory where the same factory instance is 
used in all write tasks. ReadTask's purpose is to manage the lifecycle of 
DataReader with an explicit create operation to mirror the close operation. 
This is no longer clear from the API, where DataReaderFactory appears to be 
more generic than it is and it isn't clear why a set of them is produced for a 
read.

We should rename DataReaderFactory back to ReadTask, which correctly conveys 
the purpose and use of the class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions

2018-04-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450426#comment-16450426
 ] 

Apache Spark commented on SPARK-24043:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/21144

> InterpretedPredicate.eval fails if expression tree contains Nondeterministic 
> expressions
> 
>
> Key: SPARK-24043
> URL: https://issues.apache.org/jira/browse/SPARK-24043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When whole-stage codegen and predicate codegen both fail, FilterExec falls 
> back to using InterpretedPredicate. If the predicate's expression contains 
> any non-deterministic expressions, the evaluation throws an error:
> {noformat}
> scala> val df = Seq((1)).toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: int]
> scala> df.filter('a > 0).show // this works fine
> 2018-04-21 20:39:26 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (value#1 > 0)
> +---+
> |  a|
> +---+
> |  1|
> +---+
> scala> df.filter('a > rand(7)).show // this will throw an error
> 2018-04-21 20:39:40 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (cast(value#1 as double) > rand(7))
> 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34)
> {noformat}
> This is because no code initializes the Nondeterministic expressions before 
> eval is called on them.
> This is a low impact issue, since it would require both whole-stage codegen 
> and predicate codegen to fail before FilterExec would fall back to using 
> InterpretedPredicate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24043:


Assignee: Apache Spark

> InterpretedPredicate.eval fails if expression tree contains Nondeterministic 
> expressions
> 
>
> Key: SPARK-24043
> URL: https://issues.apache.org/jira/browse/SPARK-24043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> When whole-stage codegen and predicate codegen both fail, FilterExec falls 
> back to using InterpretedPredicate. If the predicate's expression contains 
> any non-deterministic expressions, the evaluation throws an error:
> {noformat}
> scala> val df = Seq((1)).toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: int]
> scala> df.filter('a > 0).show // this works fine
> 2018-04-21 20:39:26 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (value#1 > 0)
> +---+
> |  a|
> +---+
> |  1|
> +---+
> scala> df.filter('a > rand(7)).show // this will throw an error
> 2018-04-21 20:39:40 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (cast(value#1 as double) > rand(7))
> 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34)
> {noformat}
> This is because no code initializes the Nondeterministic expressions before 
> eval is called on them.
> This is a low impact issue, since it would require both whole-stage codegen 
> and predicate codegen to fail before FilterExec would fall back to using 
> InterpretedPredicate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24043:


Assignee: (was: Apache Spark)

> InterpretedPredicate.eval fails if expression tree contains Nondeterministic 
> expressions
> 
>
> Key: SPARK-24043
> URL: https://issues.apache.org/jira/browse/SPARK-24043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When whole-stage codegen and predicate codegen both fail, FilterExec falls 
> back to using InterpretedPredicate. If the predicate's expression contains 
> any non-deterministic expressions, the evaluation throws an error:
> {noformat}
> scala> val df = Seq((1)).toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: int]
> scala> df.filter('a > 0).show // this works fine
> 2018-04-21 20:39:26 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (value#1 > 0)
> +---+
> |  a|
> +---+
> |  1|
> +---+
> scala> df.filter('a > rand(7)).show // this will throw an error
> 2018-04-21 20:39:40 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (cast(value#1 as double) > rand(7))
> 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34)
> {noformat}
> This is because no code initializes the Nondeterministic expressions before 
> eval is called on them.
> This is a low impact issue, since it would require both whole-stage codegen 
> and predicate codegen to fail before FilterExec would fall back to using 
> InterpretedPredicate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24072) clearly define pushed filters

2018-04-24 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24072:

Summary: clearly define pushed filters  (was: remove unused 
DataSourceV2Relation.pushedFilters)

> clearly define pushed filters
> -
>
> Key: SPARK-24072
> URL: https://issues.apache.org/jira/browse/SPARK-24072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23990) Instruments logging improvements - ML regression package

2018-04-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23990.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21078
[https://github.com/apache/spark/pull/21078]

> Instruments logging improvements - ML regression package
> 
>
> Key: SPARK-23990
> URL: https://issues.apache.org/jira/browse/SPARK-23990
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
> Environment: Instruments logging improvements - ML regression package
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0

2018-04-24 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450293#comment-16450293
 ] 

Herman van Hovell commented on SPARK-24051:
---

[~mgaido] do you have any idea why this is failing in Spark 2.3 specifically? 
Does it have something to do with introduction of analysis barriers?

> Incorrect results for certain queries using Java and Python APIs on Spark 
> 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23455) Default Params in ML should be saved separately

2018-04-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23455.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20633
[https://github.com/apache/spark/pull/20633]

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24072) remove unused DataSourceV2Relation.pushedFilters

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24072:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove unused DataSourceV2Relation.pushedFilters
> 
>
> Key: SPARK-24072
> URL: https://issues.apache.org/jira/browse/SPARK-24072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24072) remove unused DataSourceV2Relation.pushedFilters

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24072:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove unused DataSourceV2Relation.pushedFilters
> 
>
> Key: SPARK-24072
> URL: https://issues.apache.org/jira/browse/SPARK-24072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24072) remove unused DataSourceV2Relation.pushedFilters

2018-04-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450273#comment-16450273
 ] 

Apache Spark commented on SPARK-24072:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21143

> remove unused DataSourceV2Relation.pushedFilters
> 
>
> Key: SPARK-24072
> URL: https://issues.apache.org/jira/browse/SPARK-24072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24072) remove unused DataSourceV2Relation.pushedFilters

2018-04-24 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-24072:
---

 Summary: remove unused DataSourceV2Relation.pushedFilters
 Key: SPARK-24072
 URL: https://issues.apache.org/jira/browse/SPARK-24072
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map

2018-04-24 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450262#comment-16450262
 ] 

Kazuaki Ishizaki commented on SPARK-23933:
--

cc [~smilegator]

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-24 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450243#comment-16450243
 ] 

Xiao Li commented on SPARK-24070:
-

cc [~maropu]

> TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
> ---
>
> Key: SPARK-24070
> URL: https://issues.apache.org/jira/browse/SPARK-24070
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24071) Micro-benchmark of Parquet Filter Pushdown

2018-04-24 Thread Xiao Li (JIRA)

Xiao Li created SPARK-24071:
---

 Summary: Micro-benchmark of Parquet Filter Pushdown
 Key: SPARK-24071
 URL: https://issues.apache.org/jira/browse/SPARK-24071
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiao Li


Need a micro-benchmark suite for Parquet filter pushdown



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24070) TPC-DS Performance Tests for Parquet 1.10.0 Upgrade

2018-04-24 Thread Xiao Li (JIRA)

Xiao Li created SPARK-24070:
---

 Summary: TPC-DS Performance Tests for Parquet 1.10.0 Upgrade
 Key: SPARK-24070
 URL: https://issues.apache.org/jira/browse/SPARK-24070
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiao Li


TPC-DS performance evaluation of Apache Spark Parquet 1.8.2 and 1.10.0. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups

2018-04-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23807.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20923
[https://github.com/apache/spark/pull/20923]

> Add Hadoop 3 profile with relevant POM fix ups
> --
>
> Key: SPARK-23807
> URL: https://issues.apache.org/jira/browse/SPARK-23807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 2.4.0
>
>
> Hadoop 3, and particular Hadoop 3.1 adds:
>  * Java 8 as the minimum (and currently sole) supported Java version
>  * A new "hadoop-cloud-storage" module intended to be a minimal dependency 
> POM for all the cloud connectors in the version of hadoop built against
>  * The ability to declare a committer for any FileOutputFormat which 
> supercedes the classic FileOutputCommitter -in both a job and for a specific 
> FS URI
>  * A shaded client JAR, though not yet one complete enough for spark.
>  * Lots of other features and fixes.
> The basic work of building spark with hadoop 3 is one of just doing the build 
> with {{-Dhadoop.version=3.x.y}}; however that
>  * Doesn't build on SBT (dependency resolution of zookeeper JAR)
>  * Misses the new cloud features
> The ZK dependency can be fixed everywhere by explicitly declaring the ZK 
> artifact, instead of relying on curator to pull it in; this needs a profile 
> to declare the right ZK version, obviously..
> To use the cloud features spark the hadoop-3 profile should declare that the 
> spark-hadoop-cloud module depends on —and only on— the 
> hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud 
> storage, and a source package which is only built and tested when build 
> against Hadoop 3.1+
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23807) Add Hadoop 3 profile with relevant POM fix ups

2018-04-24 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23807:
--

Assignee: Steve Loughran

> Add Hadoop 3 profile with relevant POM fix ups
> --
>
> Key: SPARK-23807
> URL: https://issues.apache.org/jira/browse/SPARK-23807
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 2.4.0
>
>
> Hadoop 3, and particular Hadoop 3.1 adds:
>  * Java 8 as the minimum (and currently sole) supported Java version
>  * A new "hadoop-cloud-storage" module intended to be a minimal dependency 
> POM for all the cloud connectors in the version of hadoop built against
>  * The ability to declare a committer for any FileOutputFormat which 
> supercedes the classic FileOutputCommitter -in both a job and for a specific 
> FS URI
>  * A shaded client JAR, though not yet one complete enough for spark.
>  * Lots of other features and fixes.
> The basic work of building spark with hadoop 3 is one of just doing the build 
> with {{-Dhadoop.version=3.x.y}}; however that
>  * Doesn't build on SBT (dependency resolution of zookeeper JAR)
>  * Misses the new cloud features
> The ZK dependency can be fixed everywhere by explicitly declaring the ZK 
> artifact, instead of relying on curator to pull it in; this needs a profile 
> to declare the right ZK version, obviously..
> To use the cloud features spark the hadoop-3 profile should declare that the 
> spark-hadoop-cloud module depends on —and only on— the 
> hadoop/hadoop-cloud-storage module for its transitive dependencies on cloud 
> storage, and a source package which is only built and tested when build 
> against Hadoop 3.1+
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-04-24 Thread Julien Cuquemelle (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450186#comment-16450186
 ] 

Julien Cuquemelle commented on SPARK-22683:
---

Thanks for all your comments and proposals :)

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Assignee: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.4.0
>
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

[jira] [Resolved] (SPARK-24052) Support spark version showing on environment page

2018-04-24 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24052.
---
Resolution: Not A Problem

> Support spark version showing on environment page
> -
>
> Key: SPARK-24052
> URL: https://issues.apache.org/jira/browse/SPARK-24052
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
> Attachments: environment-page.png
>
>
> Since we may have multiple version in production cluster,we can showing some 
> information on environment page like below:
>  !environment-page.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23975) Allow Clustering to take Arrays of Double as input features

2018-04-24 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450161#comment-16450161
 ] 

Joseph K. Bradley commented on SPARK-23975:
---

I merged https://github.com/apache/spark/pull/21081 for KMeans, and [~lu.DB] 
will follow up for the other algs.

> Allow Clustering to take Arrays of Double as input features
> ---
>
> Key: SPARK-23975
> URL: https://issues.apache.org/jira/browse/SPARK-23975
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Priority: Major
>
> Clustering algorithms should accept Arrays in addition to Vectors as input 
> features. The python interface should also be changed so that it would make 
> PySpark a lot easier to use. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23975) Allow Clustering to take Arrays of Double as input features

2018-04-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23975:
-

Assignee: Lu Wang

> Allow Clustering to take Arrays of Double as input features
> ---
>
> Key: SPARK-23975
> URL: https://issues.apache.org/jira/browse/SPARK-23975
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Lu Wang
>Priority: Major
>
> Clustering algorithms should accept Arrays in addition to Vectors as input 
> features. The python interface should also be changed so that it would make 
> PySpark a lot easier to use. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24069) Add array_max / array_min functions

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24069:


Assignee: (was: Apache Spark)

> Add array_max / array_min functions
> ---
>
> Key: SPARK-24069
> URL: https://issues.apache.org/jira/browse/SPARK-24069
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Add R versions of SPARK-23918 and SPARK-23917



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24069) Add array_max / array_min functions

2018-04-24 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450142#comment-16450142
 ] 

Apache Spark commented on SPARK-24069:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21142

> Add array_max / array_min functions
> ---
>
> Key: SPARK-24069
> URL: https://issues.apache.org/jira/browse/SPARK-24069
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Add R versions of SPARK-23918 and SPARK-23917



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24069) Add array_max / array_min functions

2018-04-24 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24069:


Assignee: Apache Spark

> Add array_max / array_min functions
> ---
>
> Key: SPARK-24069
> URL: https://issues.apache.org/jira/browse/SPARK-24069
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Add R versions of SPARK-23918 and SPARK-23917



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24069) Add array_max / array_min functions

2018-04-24 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-24069:


 Summary: Add array_max / array_min functions
 Key: SPARK-24069
 URL: https://issues.apache.org/jira/browse/SPARK-24069
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Hyukjin Kwon


Add R versions of SPARK-23918 and SPARK-23917




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-04-24 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-22683:
-

Assignee: Julien Cuquemelle

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Assignee: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.4.0
>
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-04-24 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450131#comment-16450131
 ] 

Thomas Graves commented on SPARK-22683:
---

Note this added a new config spark.dynamicAllocation.executorAllocationRatio, 
default to 1.0 which is the same behavior as existing releases.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.4.0
>
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubs

[jira] [Resolved] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-04-24 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-22683.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.4.0
>
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24000) S3A: Create Table should fail on invalid AK/SK

2018-04-24 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450024#comment-16450024
 ] 

Steve Loughran commented on SPARK-24000:


We could consider whether or not to raise an AccessDeniedException in S3A on 
this situation, so fail fast if you can't read the bucket. Interesting issue: 
you'd need to extend the current assumed role tests to see if there's a risk 
that you could be granted access to paths under a bucket yet still have this 
test fail with 403.

# Created HADOOP-15409 about S3A at least failing fast on invalid credentials
# Is that enough? if Spark adds a getFileStatus() call on the path and catches 
all FNFEs then it would handle the auth problems, but also include other 
failure modes, e.g. network connectivity. Good or bad? 


> S3A: Create Table should fail on invalid AK/SK
> --
>
> Key: SPARK-24000
> URL: https://issues.apache.org/jira/browse/SPARK-24000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Brahma Reddy Battula
>Priority: Major
>
> Currently, When we pass the i{color:#FF}nvalid ak&&sk{color} *create 
> table* will be the *success*.
> when the S3AFileSystem is initialized, *verifyBucketExists*() is called, 
> which will return *True* as the status code 403 
> (*_BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)_*  _from following as bucket exists._
> {code:java}
> public boolean doesBucketExist(String bucketName)
>  throws AmazonClientException, AmazonServiceException {
>  
>  try {
>      headBucket(new HeadBucketRequest(bucketName));
>  return true;
>      } catch (AmazonServiceException ase) {
>  // A redirect error or a forbidden error means the bucket exists. So
>      // returning true.
>  if ((ase.getStatusCode() == Constants.BUCKET_REDIRECT_STATUS_CODE)
>      || (ase.getStatusCode() == 
> Constants.BUCKET_ACCESS_FORBIDDEN_STATUS_CODE)) {
>  return true;
>      }
>      if (ase.getStatusCode() == Constants.NO_SUCH_BUCKET_STATUS_CODE) {
>  return false;
>      }
>  throw ase;
>  
>      }
>  }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23852) Parquet MR bug can lead to incorrect SQL results

2018-04-24 Thread Eric Maynard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450018#comment-16450018
 ] 

Eric Maynard commented on SPARK-23852:
--

{color:#33}>There is no upstream release of Parquet that contains the fix 
for {color}PARQUET-1217{color:#33}, although a 1.10 release is 
planned.{color}

PARQUET-1217 seems to have been merged into Parquet 1.8.3 today.

> Parquet MR bug can lead to incorrect SQL results
> 
>
> Key: SPARK-23852
> URL: https://issues.apache.org/jira/browse/SPARK-23852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Henry Robinson
>Priority: Blocker
>  Labels: correctness
>
> Parquet MR 1.9.0 and 1.8.2 both have a bug, PARQUET-1217, that means that 
> pushing certain predicates to Parquet scanners can return fewer results than 
> they should.
> The bug triggers in Spark when:
>  * The Parquet file being scanner has stats for the null count, but not the 
> max or min on the column with the predicate (Apache Impala writes files like 
> this).
>  * The vectorized Parquet reader path is not taken, and the parquet-mr reader 
> is used.
>  * A suitable <, <=, > or >= predicate is pushed down to Parquet.
> The bug is that the parquet-mr interprets the max and min of a row-group's 
> column as 0 in the absence of stats. So {{col > 0}} will filter all results, 
> even if some are > 0.
> There is no upstream release of Parquet that contains the fix for 
> PARQUET-1217, although a 1.10 release is planned.
> The least impactful workaround is to set the Parquet configuration 
> {{parquet.filter.stats.enabled}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2018-04-24 Thread Eric Maynard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449948#comment-16449948
 ] 

Eric Maynard commented on SPARK-23519:
--

Why is the fact that you dynamically generate the statement mean that you can't 
alias the columns in your select statement? You can generate aliases as well. 
This seems like a non-issue.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Critical
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-04-24 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449935#comment-16449935
 ] 

Li Jin edited comment on SPARK-22947 at 4/24/18 2:16 PM:
-

I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization example in the log pretty much described asof 
join case in streaming mode.

 

 


was (Author: icexelloss):
I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization problem pretty much described asof join case 
in streaming mode.

 

 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.aso

[jira] [Commented] (SPARK-22947) SPIP: as-of join in Spark SQL

2018-04-24 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449935#comment-16449935
 ] 

Li Jin commented on SPARK-22947:


I came across this blog today:

[https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html]

And realized the Ad Monetization problem pretty much described asof join case 
in streaming mode.

 

 

> SPIP: as-of join in Spark SQL
> -
>
> Key: SPARK-22947
> URL: https://issues.apache.org/jira/browse/SPARK-22947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Li Jin
>Priority: Major
> Attachments: SPIP_ as-of join in Spark SQL (1).pdf
>
>
> h2. Background and Motivation
> Time series analysis is one of the most common analysis on financial data. In 
> time series analysis, as-of join is a very common operation. Supporting as-of 
> join in Spark SQL will allow many use cases of using Spark SQL for time 
> series analysis.
> As-of join is “join on time” with inexact time matching criteria. Various 
> library has implemented asof join or similar functionality:
> Kdb: https://code.kx.com/wiki/Reference/aj
> Pandas: 
> http://pandas.pydata.org/pandas-docs/version/0.19.0/merging.html#merging-merge-asof
> R: This functionality is called “Last Observation Carried Forward”
> https://www.rdocumentation.org/packages/zoo/versions/1.8-0/topics/na.locf
> JuliaDB: http://juliadb.org/latest/api/joins.html#IndexedTables.asofjoin
> Flint: https://github.com/twosigma/flint#temporal-join-functions
> This proposal advocates introducing new API in Spark SQL to support as-of 
> join.
> h2. Target Personas
> Data scientists, data engineers
> h2. Goals
> * New API in Spark SQL that allows as-of join
> * As-of join of multiple table (>2) should be performant, because it’s very 
> common that users need to join multiple data sources together for further 
> analysis.
> * Define Distribution, Partitioning and shuffle strategy for ordered time 
> series data
> h2. Non-Goals
> These are out of scope for the existing SPIP, should be considered in future 
> SPIP as improvement to Spark’s time series analysis ability:
> * Utilize partition information from data source, i.e, begin/end of each 
> partition to reduce sorting/shuffling
> * Define API for user to implement asof join time spec in business calendar 
> (i.e. lookback one business day, this is very common in financial data 
> analysis because of market calendars)
> * Support broadcast join
> h2. Proposed API Changes
> h3. TimeContext
> TimeContext is an object that defines the time scope of the analysis, it has 
> begin time (inclusive) and end time (exclusive). User should be able to 
> change the time scope of the analysis (i.e, from one month to five year) by 
> just changing the TimeContext. 
> To Spark engine, TimeContext is a hint that:
> can be used to repartition data for join
> serve as a predicate that can be pushed down to storage layer
> Time context is similar to filtering time by begin/end, the main difference 
> is that time context can be expanded based on the operation taken (see 
> example in as-of join).
> Time context example:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> {code}
> h3. asofJoin
> h4. User Case A (join without key)
> Join two DataFrames on time, with one day lookback:
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = JoinSpec(timeContext).on("time").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, quantity
> 20160101, 100
> 20160102, 50
> 20160104, -50
> 20160105, 100
> dfB:
> time, price
> 20151231, 100.0
> 20160104, 105.0
> 20160105, 102.0
> output:
> time, quantity, price
> 20160101, 100, 100.0
> 20160102, 50, null
> 20160104, -50, 105.0
> 20160105, 100, 102.0
> {code}
> Note row (20160101, 100) of dfA is joined with (20151231, 100.0) of dfB. This 
> is an important illustration of the time context - it is able to expand the 
> context to 20151231 on dfB because of the 1 day lookback.
> h4. Use Case B (join with key)
> To join on time and another key (for instance, id), we use “by” to specify 
> the key.
> {code:java}
> TimeContext timeContext = TimeContext("20160101", "20170101")
> dfA = ...
> dfB = ...
> JoinSpec joinSpec = 
> JoinSpec(timeContext).on("time").by("id").tolerance("-1day")
> result = dfA.asofJoin(dfB, joinSpec)
> {code}
> Example input/output:
> {code:java}
> dfA:
> time, id, quantity
> 20160101, 1, 100
> 20160101, 2, 50
> 20160102, 1, -50
> 20160102, 2, 50
> dfB:
> time, id, price
> 20151231, 1, 100.0
> 20150102, 1, 105.0
> 20150102, 2, 195.0
> Output:
> time, id, quantity, price
> 20160101, 1, 100, 100.0
>

[jira] [Created] (SPARK-24068) CSV schema inferring doesn't work for compressed files

2018-04-24 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-24068:
--

 Summary: CSV schema inferring doesn't work for compressed files
 Key: SPARK-24068
 URL: https://issues.apache.org/jira/browse/SPARK-24068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Maxim Gekk


Here is a simple csv file compressed by lzo
{code}
$ cat ./test.csv
col1,col2
a,1
$ lzop ./test.csv
$ ls
test.csv test.csv.lzo
{code}

Reading test.csv.lzo with LZO codec (see https://github.com/twitter/hadoop-lzo, 
for example):
{code:scala}
scala> val ds = spark.read.option("header", true).option("inferSchema", 
true).option("io.compression.codecs", 
"com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
ds: org.apache.spark.sql.DataFrame = [�LZO?: string]

scala> ds.printSchema
root
 |-- �LZO: string (nullable = true)


scala> ds.show
+-+
|�LZO|
+-+
|a|
+-+
{code}
but the file can be read if the schema is specified:
{code}
scala> import org.apache.spark.sql.types._
scala> val schema = new StructType().add("col1", StringType).add("col2", 
IntegerType)
scala> val ds = spark.read.schema(schema).option("header", 
true).option("io.compression.codecs", 
"com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
scala> ds.show
+++
|col1|col2|
+++
|   a|   1|
+++
{code}

Just in case, schema inferring works for the original uncompressed file:
{code:scala}
scala> spark.read.option("header", true).option("inferSchema", 
true).csv("test.csv").printSchema
root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = true)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23182) Allow enabling of TCP keep alive for master RPC connections

2018-04-24 Thread Petar Petrov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petar Petrov updated SPARK-23182:
-
Affects Version/s: 2.2.2

> Allow enabling of TCP keep alive for master RPC connections
> ---
>
> Key: SPARK-23182
> URL: https://issues.apache.org/jira/browse/SPARK-23182
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.4.0
>Reporter: Petar Petrov
>Priority: Major
>
> We rely heavily on preemptible worker machines in GCP/GCE. These machines 
> disappear without closing the TCP connections to the master which increases 
> the number of established connections and new workers can not connect because 
> of "Too many open files" on the master.
> To solve the problem we need to enable TCP keep alive for the RPC connections 
> to the master but it's not possible to do so via configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-04-24 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449857#comment-16449857
 ] 

Steve Loughran commented on SPARK-18673:


It's a big hive patch, but most of it is hbase related.

I'm going to offer to help cherrypick the relevant hive changes to the 
org.sparkproject hive JAR, but I'm going to place a dependency on SPARK-23807 
first, so that you can actually build against hadoop-3 properly in both maven 
and SBT.

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-04-24 Thread Joachim Hereth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joachim Hereth updated SPARK-24067:
---
Description: 
SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The  [PR 
w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This 
should be backported to 2.3.

 

Original Description from SPARK-17147 :

 

When Kafka does log compaction offsets often end up with gaps, meaning the next 
requested offset will be frequently not be offset+1. The logic in KafkaRDD & 
CachedKafkaConsumer has a baked in assumption that the next offset will always 
be just an increment of 1 above the previous offset.

I have worked around this problem by changing CachedKafkaConsumer to use the 
returned record's offset, from:
 {{nextOffset = offset + 1}}
 to:
 {{nextOffset = record.offset + 1}}

and changed KafkaRDD from:
 {{requestOffset += 1}}
 to:
 {{requestOffset = r.offset() + 1}}

(I also had to change some assert logic in CachedKafkaConsumer).

There's a strong possibility that I have misconstrued how to use the streaming 
kafka consumer, and I'm happy to close this out if that's the case. If, 
however, it is supposed to support non-consecutive offsets (e.g. due to log 
compaction) I am also happy to contribute a PR.

  was:
Spark 2.3 Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
(i.e. Log Compaction)

 

When Kafka does log compaction offsets often end up with gaps, meaning the next 
requested offset will be frequently not be offset+1. The logic in KafkaRDD & 
CachedKafkaConsumer has a baked in assumption that the next offset will always 
be just an increment of 1 above the previous offset.

I have worked around this problem by changing CachedKafkaConsumer to use the 
returned record's offset, from:
 {{nextOffset = offset + 1}}
 to:
 {{nextOffset = record.offset + 1}}

and changed KafkaRDD from:
 {{requestOffset += 1}}
 to:
 {{requestOffset = r.offset() + 1}}

(I also had to change some assert logic in CachedKafkaConsumer).

There's a strong possibility that I have misconstrued how to use the streaming 
kafka consumer, and I'm happy to close this out if that's the case. If, 
however, it is supposed to support non-consecutive offsets (e.g. due to log 
compaction) I am also happy to contribute a PR.


> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Cody Koeninger
>Priority: Major
>
> SPARK-17147 fixes a problem with non-consecutive Kafka Offsets. The  [PR 
> w|https://github.com/apache/spark/pull/20572]as merged to {{master}}. This 
> should be backported to 2.3.
>  
> Original Description from SPARK-17147 :
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-04-24 Thread Joachim Hereth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joachim Hereth updated SPARK-24067:
---
Affects Version/s: (was: 2.0.0)
   2.3.0

> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Cody Koeninger
>Priority: Major
>
> Spark 2.3 Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24067) Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets (i.e. Log Compaction))

2018-04-24 Thread Joachim Hereth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joachim Hereth updated SPARK-24067:
---
Fix Version/s: (was: 2.4.0)

> Backport SPARK-17147 to 2.3 (Spark Streaming Kafka 0.10 Consumer Can't Handle 
> Non-consecutive Offsets (i.e. Log Compaction))
> 
>
> Key: SPARK-24067
> URL: https://issues.apache.org/jira/browse/SPARK-24067
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: Joachim Hereth
>Assignee: Cody Koeninger
>Priority: Major
>
> Spark 2.3 Streaming Kafka 0.10 Consumer Can't Handle Non-consecutive Offsets 
> (i.e. Log Compaction)
>  
> When Kafka does log compaction offsets often end up with gaps, meaning the 
> next requested offset will be frequently not be offset+1. The logic in 
> KafkaRDD & CachedKafkaConsumer has a baked in assumption that the next offset 
> will always be just an increment of 1 above the previous offset.
> I have worked around this problem by changing CachedKafkaConsumer to use the 
> returned record's offset, from:
>  {{nextOffset = offset + 1}}
>  to:
>  {{nextOffset = record.offset + 1}}
> and changed KafkaRDD from:
>  {{requestOffset += 1}}
>  to:
>  {{requestOffset = r.offset() + 1}}
> (I also had to change some assert logic in CachedKafkaConsumer).
> There's a strong possibility that I have misconstrued how to use the 
> streaming kafka consumer, and I'm happy to close this out if that's the case. 
> If, however, it is supposed to support non-consecutive offsets (e.g. due to 
> log compaction) I am also happy to contribute a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 135 matches

Mail list logo