[jira] [Created] (SPARK-12904) Strength reduction for integer/decimal comparisons

2016-01-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12904:
---

 Summary: Strength reduction for integer/decimal comparisons
 Key: SPARK-12904
 URL: https://issues.apache.org/jira/browse/SPARK-12904
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, SQL
Reporter: Reynold Xin


We can do the following strength reduction for comparisons between an integral 
column and a decimal literal:

1. int_col > decimal_literal => int_col > floor(decimal_literal)

2. int_col >= decimal_literal => int_col > ceil(decimal_literal)

3. int_col < decimal_literal => int_col < floor(decimal_literal)

4. int_col <= decimal_literal => int_col < ceil(decimal_literal)


This is useful more as soon as we start parsing floating point numeric literals 
as decimals rather than doubles (SPARK-12848).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12904) Strength reduction for integer/decimal comparisons

2016-01-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106436#comment-15106436
 ] 

Reynold Xin commented on SPARK-12904:
-

[~viirya] maybe you can add this when you have a chance. Thanks!


> Strength reduction for integer/decimal comparisons
> --
>
> Key: SPARK-12904
> URL: https://issues.apache.org/jira/browse/SPARK-12904
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Reporter: Reynold Xin
>
> We can do the following strength reduction for comparisons between an 
> integral column and a decimal literal:
> 1. int_col > decimal_literal => int_col > floor(decimal_literal)
> 2. int_col >= decimal_literal => int_col > ceil(decimal_literal)
> 3. int_col < decimal_literal => int_col < floor(decimal_literal)
> 4. int_col <= decimal_literal => int_col < ceil(decimal_literal)
> This is more useful as soon as we start parsing floating point numeric 
> literals as decimals rather than doubles (SPARK-12848).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12905) PCAModel return eigenvalues for PySpark

2016-01-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12905:

Priority: Minor  (was: Trivial)

> PCAModel return eigenvalues for PySpark
> ---
>
> Key: SPARK-12905
> URL: https://issues.apache.org/jira/browse/SPARK-12905
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> PCAModel return eigenvalues for PySpark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12905) PCAModel return eigenvalues for PySpark

2016-01-19 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-12905:

Priority: Trivial  (was: Minor)

> PCAModel return eigenvalues for PySpark
> ---
>
> Key: SPARK-12905
> URL: https://issues.apache.org/jira/browse/SPARK-12905
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Trivial
>
> PCAModel return eigenvalues for PySpark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12905) PCAModel return eigenvalues for PySpark

2016-01-19 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-12905:
---

 Summary: PCAModel return eigenvalues for PySpark
 Key: SPARK-12905
 URL: https://issues.apache.org/jira/browse/SPARK-12905
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor


PCAModel return eigenvalues for PySpark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12905) PCAModel return eigenvalues for PySpark

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12905:


Assignee: (was: Apache Spark)

> PCAModel return eigenvalues for PySpark
> ---
>
> Key: SPARK-12905
> URL: https://issues.apache.org/jira/browse/SPARK-12905
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> PCAModel return eigenvalues for PySpark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12905) PCAModel return eigenvalues for PySpark

2016-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106444#comment-15106444
 ] 

Apache Spark commented on SPARK-12905:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10830

> PCAModel return eigenvalues for PySpark
> ---
>
> Key: SPARK-12905
> URL: https://issues.apache.org/jira/browse/SPARK-12905
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> PCAModel return eigenvalues for PySpark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12905) PCAModel return eigenvalues for PySpark

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12905:


Assignee: Apache Spark

> PCAModel return eigenvalues for PySpark
> ---
>
> Key: SPARK-12905
> URL: https://issues.apache.org/jira/browse/SPARK-12905
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> PCAModel return eigenvalues for PySpark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12897) spark ui with tachyon show all Stream Blocks

2016-01-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106458#comment-15106458
 ] 

Sean Owen commented on SPARK-12897:
---

It's not clear what you're describing. Can you say more? I am also not sure 
what the future of Tachyon is in Spark itself; it may live outside the project.

> spark ui with tachyon show all Stream Blocks
> 
>
> Key: SPARK-12897
> URL: https://issues.apache.org/jira/browse/SPARK-12897
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Web UI
>Affects Versions: 1.6.0
> Environment: "spark.externalBlockStore.url", "tachyon://l-xxx:19998"
> "spark.externalBlockStore.blockManager", 
> "org.apache.spark.storage.TachyonBlockManager"
> StorageLevel.OFF_HEAP
>Reporter: astralidea
>
> when I use tachyon .I click spark application UI storage.
> the job already running for 24hours.
> but stream Blocks continued to growing and the page is loading and showing 
> all blocks but some block is never use.and loading slowly and slowly.
> I think tachyon should use traditional way. if block is never used it means 
> Size is 0.0B.it is not need to show in storage page.just as traditional way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12876) Race condition when driver rapidly shutdown after started.

2016-01-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12876.
---
Resolution: Duplicate

> Race condition when driver rapidly shutdown after started.
> --
>
> Key: SPARK-12876
> URL: https://issues.apache.org/jira/browse/SPARK-12876
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: jeffonia Tung
>Priority: Minor
>
> It's a little same as the issue: SPARK-4300. Well, this time, it's happen on 
> the driver occasionally.
> [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Asked to launch driver 
> driver-20160118171237-0009
> [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying user jar 
> file:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
>  to /data/dbcenter/cdh5/spark-1.4.0-bin-hado
> op2.4/work/driver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
> [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Copying 
> /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/mylib/spark-ly-streaming-v2-201601141018.jar
>  to /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/dri
> ver-20160118171237-0009/spark-ly-streaming-v2-201601141018.jar
> [INFO 2016-01-18 17:12:35 (Logging.scala:59)] Launch Command: 
> "/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" 
> ."org.apache.spark.deploy.worker.DriverWrapper"..
> [INFO 2016-01-18 17:12:39 (Logging.scala:59)] Asked to launch executor 
> app-20160118171240-0256/15 for DirectKafkaStreamingV2
> [INFO 2016-01-18 17:12:39 (Logging.scala:59)] Launch command: 
> "/data/dbcenter/jdk1.7.0_79/bin/java" "-cp"  
> ."org.apache.spark.executor.CoarseGrainedExecutorBackend"..
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill driver 
> driver-20160118164724-0008
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Redirection to 
> /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/driver-20160118164724-0008/stdout
>  closed: Stream closed
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Asked to kill executor 
> app-20160118164728-0250/15
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Runner thread for executor 
> app-20160118164728-0250/15 interrupted
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Killing process!
> [ERROR 2016-01-18 17:12:49 (Logging.scala:96)] Error writing stream to file 
> /data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/work/app-20160118164728-0250/15/stdout
> java.io.IOException: Stream closed
> at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> at java.io.FilterInputStream.read(FilterInputStream.java:107)
> at 
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
> at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
> at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> at 
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
> at 
> org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
> [INFO 2016-01-18 17:12:49 (Logging.scala:59)] Executor 
> app-20160118164728-0250/15 finished with state KILLED exitStatus 143



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12848) Parse number as decimal rather than doubles

2016-01-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12848:

Summary: Parse number as decimal rather than doubles  (was: Parse number as 
decimal)

> Parse number as decimal rather than doubles
> ---
>
> Key: SPARK-12848
> URL: https://issues.apache.org/jira/browse/SPARK-12848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, Hive parser will parse 1.23 as double, when it's used with decimal 
> columns, you will turn the decimal into double, lose the precision.
> We should follow most database had done, parse 1.23 as double, it will be 
> converted into double when used with double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12848) Parse numbers as decimals rather than doubles

2016-01-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12848:

Summary: Parse numbers as decimals rather than doubles  (was: Parse number 
as decimal rather than doubles)

> Parse numbers as decimals rather than doubles
> -
>
> Key: SPARK-12848
> URL: https://issues.apache.org/jira/browse/SPARK-12848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, Hive parser will parse 1.23 as double, when it's used with decimal 
> columns, you will turn the decimal into double, lose the precision.
> We should follow most database had done, parse 1.23 as double, it will be 
> converted into double when used with double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12904) Strength reduction for integer/decimal comparisons

2016-01-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12904:

Description: 
We can do the following strength reduction for comparisons between an integral 
column and a decimal literal:

1. int_col > decimal_literal => int_col > floor(decimal_literal)

2. int_col >= decimal_literal => int_col > ceil(decimal_literal)

3. int_col < decimal_literal => int_col < floor(decimal_literal)

4. int_col <= decimal_literal => int_col < ceil(decimal_literal)

This is more useful as soon as we start parsing floating point numeric literals 
as decimals rather than doubles (SPARK-12848).



  was:
We can do the following strength reduction for comparisons between an integral 
column and a decimal literal:

1. int_col > decimal_literal => int_col > floor(decimal_literal)

2. int_col >= decimal_literal => int_col > ceil(decimal_literal)

3. int_col < decimal_literal => int_col < floor(decimal_literal)

4. int_col <= decimal_literal => int_col < ceil(decimal_literal)


This is useful more as soon as we start parsing floating point numeric literals 
as decimals rather than doubles (SPARK-12848).



> Strength reduction for integer/decimal comparisons
> --
>
> Key: SPARK-12904
> URL: https://issues.apache.org/jira/browse/SPARK-12904
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Reporter: Reynold Xin
>
> We can do the following strength reduction for comparisons between an 
> integral column and a decimal literal:
> 1. int_col > decimal_literal => int_col > floor(decimal_literal)
> 2. int_col >= decimal_literal => int_col > ceil(decimal_literal)
> 3. int_col < decimal_literal => int_col < floor(decimal_literal)
> 4. int_col <= decimal_literal => int_col < ceil(decimal_literal)
> This is more useful as soon as we start parsing floating point numeric 
> literals as decimals rather than doubles (SPARK-12848).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12892) Support plugging in Spark scheduler

2016-01-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106471#comment-15106471
 ] 

Sean Owen commented on SPARK-12892:
---

Do you intend some of the same stuff described in 
https://issues.apache.org/jira/browse/SPARK-3561 ? this was rejected a while 
ago.

> Support plugging in Spark scheduler 
> 
>
> Key: SPARK-12892
> URL: https://issues.apache.org/jira/browse/SPARK-12892
> Project: Spark
>  Issue Type: Improvement
>Reporter: Timothy Chen
>
> Currently the only supported cluster schedulers are standalone, Mesos, Yarn 
> and Simr. However if users like to build a new one it must be merged back 
> into main, and might not be desirable for Spark and hard to iterate.
> Instead, we should make a plugin architecture possible so that when users 
> like to integrate with new scheduler it can plugged in via configuration and 
> runtime loading instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106548#comment-15106548
 ] 

Sasi commented on SPARK-12906:
--

Hi,
I'm running endless script that does query on my table using SqlSpark.
I did a dump using jmap tool before the test and after the test.
I also run GC on the environment and I still saw the size of LognSQLMetricValue 
didn't change and only increased.

My suspect is on the count method, because i'm running 
dataFrame.distinct().count().

Sasi

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12903) Add covar_samp and covar_pop for SparkR

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12903:


Assignee: Apache Spark

> Add covar_samp and covar_pop for SparkR
> ---
>
> Key: SPARK-12903
> URL: https://issues.apache.org/jira/browse/SPARK-12903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Add covar_samp and covar_pop for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12903) Add covar_samp and covar_pop for SparkR

2016-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106422#comment-15106422
 ] 

Apache Spark commented on SPARK-12903:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10829

> Add covar_samp and covar_pop for SparkR
> ---
>
> Key: SPARK-12903
> URL: https://issues.apache.org/jira/browse/SPARK-12903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Add covar_samp and covar_pop for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12903) Add covar_samp and covar_pop for SparkR

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12903:


Assignee: (was: Apache Spark)

> Add covar_samp and covar_pop for SparkR
> ---
>
> Key: SPARK-12903
> URL: https://issues.apache.org/jira/browse/SPARK-12903
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>
> Add covar_samp and covar_pop for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12903) Add covar_samp and covar_pop for SparkR

2016-01-19 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-12903:
---

 Summary: Add covar_samp and covar_pop for SparkR
 Key: SPARK-12903
 URL: https://issues.apache.org/jira/browse/SPARK-12903
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Yanbo Liang


Add covar_samp and covar_pop for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12904) Strength reduction for integer/decimal comparisons

2016-01-19 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106491#comment-15106491
 ] 

Liang-Chi Hsieh commented on SPARK-12904:
-

Yeah. I would like to do. Thanks!

> Strength reduction for integer/decimal comparisons
> --
>
> Key: SPARK-12904
> URL: https://issues.apache.org/jira/browse/SPARK-12904
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Reporter: Reynold Xin
>
> We can do the following strength reduction for comparisons between an 
> integral column and a decimal literal:
> 1. int_col > decimal_literal => int_col > floor(decimal_literal)
> 2. int_col >= decimal_literal => int_col > ceil(decimal_literal)
> 3. int_col < decimal_literal => int_col < floor(decimal_literal)
> 4. int_col <= decimal_literal => int_col < ceil(decimal_literal)
> This is more useful as soon as we start parsing floating point numeric 
> literals as decimals rather than doubles (SPARK-12848).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7683) Confusing behavior of fold function of RDD in pyspark

2016-01-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-7683:


Assignee: Sean Owen

> Confusing behavior of fold function of RDD in pyspark
> -
>
> Key: SPARK-7683
> URL: https://issues.apache.org/jira/browse/SPARK-7683
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> This will make the “fold” function consistent with the "fold" in rdd.scala 
> and other "aggregate" functions where “acc” goes first. Otherwise, users have 
> to write a lambda function like “lambda x, y: op(y, x)” if they want to use 
> “zeroValue” to change the result type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7683) Confusing behavior of fold function of RDD in pyspark

2016-01-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7683.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10771
[https://github.com/apache/spark/pull/10771]

> Confusing behavior of fold function of RDD in pyspark
> -
>
> Key: SPARK-7683
> URL: https://issues.apache.org/jira/browse/SPARK-7683
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.3.1
>Reporter: Ai He
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> This will make the “fold” function consistent with the "fold" in rdd.scala 
> and other "aggregate" functions where “acc” goes first. Otherwise, users have 
> to write a lambda function like “lambda x, y: op(y, x)” if they want to use 
> “zeroValue” to change the result type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Sasi (JIRA)
Sasi created SPARK-12906:


 Summary: LongSQLMetricValue cause memory leak on Spark 1.5.1
 Key: SPARK-12906
 URL: https://issues.apache.org/jira/browse/SPARK-12906
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.1
Reporter: Sasi


Hi,
I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
Now, after doing another dump heap I notice, after 2 hours, that 
LongSQLMetricValue cause memory leak.

Didn't see any bug or document about it.

Thanks,
Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Sasi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasi updated SPARK-12906:
-
Attachment: screenshot-1.png

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106498#comment-15106498
 ] 

Sean Owen commented on SPARK-12906:
---

Please provide more detail. What leads you to think there's a leak, and where 
do you suspect the leak is?

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9976) create function do not work

2016-01-19 Thread ocean (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106517#comment-15106517
 ] 

ocean commented on SPARK-9976:
--

the second problem I just found that only function can not describe, but still 
can use. Just a little problem

> create function do not work
> ---
>
> Key: SPARK-9976
> URL: https://issues.apache.org/jira/browse/SPARK-9976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
> Environment: spark 1.4.1 yarn 2.2.0
>Reporter: cen yuhai
>
> I use beeline to connect to ThriftServer, but add jar can not work, so I use 
> create function , see the link below.
> http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html
> I do as blow:
> {code}
> create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 
> 'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; 
> {code}
> It returns Ok, and I connect to the metastore, I see records in table FUNCS.
> {code}
> select gdecodeorder(t1)  from tableX  limit 1;
> {code}
> It returns error 'Couldn't find function default.gdecodeorder'
> This is the Exception
> {code}
> 15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException 
> as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: 
> java.lang.RuntimeException: Couldn't find function default.gdecodeorder
> 15/08/14 15:04:47 ERROR RetryingHMSHandler: 
> MetaException(message:NoSuchObjectException(message:Function 
> default.t_gdecodeorder does not exist))
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740)
> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
> at com.sun.proxy.$Proxy21.get_function(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721)
> at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
> at com.sun.proxy.$Proxy22.getFunction(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
> at 
> 

[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106596#comment-15106596
 ] 

Sasi commented on SPARK-12906:
--

added dumps



> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Sasi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasi updated SPARK-12906:
-
Attachment: dump1.PNG

After GC.

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: dump1.PNG, screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Sasi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasi updated SPARK-12906:
-
Comment: was deleted

(was: added dumps

)

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: dump1.PNG, screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106599#comment-15106599
 ] 

Sasi edited comment on SPARK-12906 at 1/19/16 11:14 AM:


added dump After GC.

Code:

while (true) {
subscribersDataFrame.distinct().count()
}


was (Author: sasi2103):
added dump After GC.

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: dump1.PNG, screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106599#comment-15106599
 ] 

Sasi edited comment on SPARK-12906 at 1/19/16 11:12 AM:


added dump After GC.


was (Author: sasi2103):
After GC.

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: dump1.PNG, screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12560) SqlTestUtils.stripSparkFilter needs to copy utf8strings

2016-01-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-12560.

Resolution: Fixed
  Assignee: Imran Rashid

Resolved by https://github.com/apache/spark/pull/10510

> SqlTestUtils.stripSparkFilter needs to copy utf8strings
> ---
>
> Key: SPARK-12560
> URL: https://issues.apache.org/jira/browse/SPARK-12560
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
>
> {{SqlTestUtils.stripSparkFilter}} needs to make copies of the UTF8Strings, 
> eg., with {{FromUnsafeProjection}} to avoid returning duplicates of the same 
> row (see SPARK-9459).
> Right now, this isn't causing any problems, since the parquet string 
> predicate pushdown is turned off (see SPARK-11153).  However I ran into this 
> while trying to get the predicate pushdown to work with a different version 
> of parquet.  Without this fix, there were errors like:
> {noformat}
> [info]   !== Correct Answer - 4 ==   == Spark Answer - 4 ==
> [info]   ![1][2]
> [info][2][2]
> [info]   ![3][4]
> [info][4][4] (QueryTest.scala:127)
> {noformat}
> I figure its worth making this change now while I ran into it.  PR coming 
> shortly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12910:


Assignee: (was: Apache Spark)

> Support for specifying version of R to use while creating sparkR libraries
> --
>
> Key: SPARK-12910
> URL: https://issues.apache.org/jira/browse/SPARK-12910
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
> Environment: Linux
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: installation, sparkR
>
> When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. 
> However, a user might have locally installed their own version of R. There 
> should be a way to specify which R version to use. 
> I have fixed this in my code using the following patch:
> ```
> $ git diff HEAD
> diff --git a/R/README.md b/R/README.md
> index 005f56d..99182e5 100644
> --- a/R/README.md
> +++ b/R/README.md
> @@ -1,6 +1,15 @@
>  # R on Spark
>  
>  SparkR is an R package that provides a light-weight frontend to use Spark 
> from R.
> +### Installing sparkR
> +
> +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be 
> done by running the script `$SPARK_HOME/R/install-dev.sh`.
> +By default the above script uses the system wide installation of R. However, 
> this can be changed to any user installed location of R by giving the full 
> path of the `$R_HOME` as the first argument to the install-dev.sh script.
> +Example: 
> +```
> +# where /home/username/R is where R is installed and /home/username/R/bin 
> contains the files R and RScript
> +./install-dev.sh /home/username/R 
> +```
>  
>  ### SparkR development
>  
> diff --git a/R/install-dev.sh b/R/install-dev.sh
> index 4972bb9..a8efa86 100755
> --- a/R/install-dev.sh
> +++ b/R/install-dev.sh
> @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib"
>  mkdir -p $LIB_DIR
>  
>  pushd $FWDIR > /dev/null
> +if [ ! -z "$1" ]
> +  then
> +R_HOME="$1/bin"
> +   else
> +R_HOME="$(dirname $(which R))"
> +fi
> +echo "USING R_HOME = $R_HOME"
>  
>  # Generate Rd files if devtools is installed
> -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
> +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
>  
>  # Install SparkR to $LIB_DIR
> -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
> +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
>  
>  # Zip the SparkR package so that it can be distributed to worker nodes on 
> YARN
>  cd $LIB_DIR
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12560) SqlTestUtils.stripSparkFilter needs to copy utf8strings

2016-01-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12560:
---
Fix Version/s: 2.0.0

> SqlTestUtils.stripSparkFilter needs to copy utf8strings
> ---
>
> Key: SPARK-12560
> URL: https://issues.apache.org/jira/browse/SPARK-12560
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{SqlTestUtils.stripSparkFilter}} needs to make copies of the UTF8Strings, 
> eg., with {{FromUnsafeProjection}} to avoid returning duplicates of the same 
> row (see SPARK-9459).
> Right now, this isn't causing any problems, since the parquet string 
> predicate pushdown is turned off (see SPARK-11153).  However I ran into this 
> while trying to get the predicate pushdown to work with a different version 
> of parquet.  Without this fix, there were errors like:
> {noformat}
> [info]   !== Correct Answer - 4 ==   == Spark Answer - 4 ==
> [info]   ![1][2]
> [info][2][2]
> [info]   ![3][4]
> [info][4][4] (QueryTest.scala:127)
> {noformat}
> I figure its worth making this change now while I ran into it.  PR coming 
> shortly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12910:


Assignee: Apache Spark

> Support for specifying version of R to use while creating sparkR libraries
> --
>
> Key: SPARK-12910
> URL: https://issues.apache.org/jira/browse/SPARK-12910
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
> Environment: Linux
>Reporter: Shubhanshu Mishra
>Assignee: Apache Spark
>Priority: Minor
>  Labels: installation, sparkR
>
> When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. 
> However, a user might have locally installed their own version of R. There 
> should be a way to specify which R version to use. 
> I have fixed this in my code using the following patch:
> ```
> $ git diff HEAD
> diff --git a/R/README.md b/R/README.md
> index 005f56d..99182e5 100644
> --- a/R/README.md
> +++ b/R/README.md
> @@ -1,6 +1,15 @@
>  # R on Spark
>  
>  SparkR is an R package that provides a light-weight frontend to use Spark 
> from R.
> +### Installing sparkR
> +
> +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be 
> done by running the script `$SPARK_HOME/R/install-dev.sh`.
> +By default the above script uses the system wide installation of R. However, 
> this can be changed to any user installed location of R by giving the full 
> path of the `$R_HOME` as the first argument to the install-dev.sh script.
> +Example: 
> +```
> +# where /home/username/R is where R is installed and /home/username/R/bin 
> contains the files R and RScript
> +./install-dev.sh /home/username/R 
> +```
>  
>  ### SparkR development
>  
> diff --git a/R/install-dev.sh b/R/install-dev.sh
> index 4972bb9..a8efa86 100755
> --- a/R/install-dev.sh
> +++ b/R/install-dev.sh
> @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib"
>  mkdir -p $LIB_DIR
>  
>  pushd $FWDIR > /dev/null
> +if [ ! -z "$1" ]
> +  then
> +R_HOME="$1/bin"
> +   else
> +R_HOME="$(dirname $(which R))"
> +fi
> +echo "USING R_HOME = $R_HOME"
>  
>  # Generate Rd files if devtools is installed
> -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
> +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
>  
>  # Install SparkR to $LIB_DIR
> -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
> +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
>  
>  # Zip the SparkR package so that it can be distributed to worker nodes on 
> YARN
>  cd $LIB_DIR
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries

2016-01-19 Thread Shubhanshu Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107370#comment-15107370
 ] 

Shubhanshu Mishra commented on SPARK-12910:
---

I have created a pull request at https://github.com/apache/spark/pull/10836

> Support for specifying version of R to use while creating sparkR libraries
> --
>
> Key: SPARK-12910
> URL: https://issues.apache.org/jira/browse/SPARK-12910
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
> Environment: Linux
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: installation, sparkR
>
> When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. 
> However, a user might have locally installed their own version of R. There 
> should be a way to specify which R version to use. 
> I have fixed this in my code using the following patch:
> ```
> $ git diff HEAD
> diff --git a/R/README.md b/R/README.md
> index 005f56d..99182e5 100644
> --- a/R/README.md
> +++ b/R/README.md
> @@ -1,6 +1,15 @@
>  # R on Spark
>  
>  SparkR is an R package that provides a light-weight frontend to use Spark 
> from R.
> +### Installing sparkR
> +
> +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be 
> done by running the script `$SPARK_HOME/R/install-dev.sh`.
> +By default the above script uses the system wide installation of R. However, 
> this can be changed to any user installed location of R by giving the full 
> path of the `$R_HOME` as the first argument to the install-dev.sh script.
> +Example: 
> +```
> +# where /home/username/R is where R is installed and /home/username/R/bin 
> contains the files R and RScript
> +./install-dev.sh /home/username/R 
> +```
>  
>  ### SparkR development
>  
> diff --git a/R/install-dev.sh b/R/install-dev.sh
> index 4972bb9..a8efa86 100755
> --- a/R/install-dev.sh
> +++ b/R/install-dev.sh
> @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib"
>  mkdir -p $LIB_DIR
>  
>  pushd $FWDIR > /dev/null
> +if [ ! -z "$1" ]
> +  then
> +R_HOME="$1/bin"
> +   else
> +R_HOME="$(dirname $(which R))"
> +fi
> +echo "USING R_HOME = $R_HOME"
>  
>  # Generate Rd files if devtools is installed
> -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
> +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
>  
>  # Install SparkR to $LIB_DIR
> -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
> +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
>  
>  # Zip the SparkR package so that it can be distributed to worker nodes on 
> YARN
>  cd $LIB_DIR
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107534#comment-15107534
 ] 

Apache Spark commented on SPARK-12869:
--

User 'Fokko' has created a pull request for this issue:
https://github.com/apache/spark/pull/10839

> Optimize conversion from BlockMatrix to IndexedRowMatrix
> 
>
> Key: SPARK-12869
> URL: https://issues.apache.org/jira/browse/SPARK-12869
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Fokko Driesprong
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the current implementation of the BlockMatrix, the conversion to the 
> IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This 
> is somewhat ok when the matrix is very sparse, but for dense matrices this is 
> very inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12869:


Assignee: (was: Apache Spark)

> Optimize conversion from BlockMatrix to IndexedRowMatrix
> 
>
> Key: SPARK-12869
> URL: https://issues.apache.org/jira/browse/SPARK-12869
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Fokko Driesprong
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the current implementation of the BlockMatrix, the conversion to the 
> IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This 
> is somewhat ok when the matrix is very sparse, but for dense matrices this is 
> very inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-01-19 Thread Fokko Driesprong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107533#comment-15107533
 ] 

Fokko Driesprong commented on SPARK-12869:
--

Hi guys,

I've implemented an improved version of the toIndexedRowMatrix function on the 
BlockMatrix. I needed this for a project, but would like to share it with the 
rest of the community. In the case of dense matrices, it can increase 
performance up to 19 times:
https://github.com/Fokko/BlockMatrixToIndexedRowMatrix

The pull-request on Github:
https://github.com/apache/spark/pull/10839

> Optimize conversion from BlockMatrix to IndexedRowMatrix
> 
>
> Key: SPARK-12869
> URL: https://issues.apache.org/jira/browse/SPARK-12869
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Fokko Driesprong
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the current implementation of the BlockMatrix, the conversion to the 
> IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This 
> is somewhat ok when the matrix is very sparse, but for dense matrices this is 
> very inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12869) Optimize conversion from BlockMatrix to IndexedRowMatrix

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12869:


Assignee: Apache Spark

> Optimize conversion from BlockMatrix to IndexedRowMatrix
> 
>
> Key: SPARK-12869
> URL: https://issues.apache.org/jira/browse/SPARK-12869
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Fokko Driesprong
>Assignee: Apache Spark
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In the current implementation of the BlockMatrix, the conversion to the 
> IndexedRowMatrix is done by converting it to a CoordinateMatrix first. This 
> is somewhat ok when the matrix is very sparse, but for dense matrices this is 
> very inefficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12479:


Assignee: (was: Apache Spark)

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12479) sparkR collect on GroupedData throws R error "missing value where TRUE/FALSE needed"

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12479:


Assignee: Apache Spark

>  sparkR collect on GroupedData  throws R error "missing value where 
> TRUE/FALSE needed"
> --
>
> Key: SPARK-12479
> URL: https://issues.apache.org/jira/browse/SPARK-12479
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Paulo Magalhaes
>Assignee: Apache Spark
>
> sparkR collect on GroupedData  throws "missing value where TRUE/FALSE needed"
> Spark Version: 1.5.1
> R Version: 3.2.2
> I tracked down the root cause of this exception to an specific key for which 
> the hashCode could not be calculated.
> The following code recreates the problem when ran in sparkR:
> hashCode <- getFromNamespace("hashCode","SparkR")
> hashCode("bc53d3605e8a5b7de1e8e271c2317645")
> Error in if (value > .Machine$integer.max) { :
>   missing value where TRUE/FALSE needed
> I went one step further and relaised the the problem happens because of the  
> bit wise shift below returning NA.
> bitwShiftL(-1073741824,1)
> where bitwShiftL is an R function. 
> I believe the bitwShiftL function is working as it is supposed to. Therefore, 
> this PR fixes it in the SparkR package: 
> https://github.com/apache/spark/pull/10436
> .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11295) Add packages to JUnit output for Python tests

2016-01-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11295:
--
Assignee: Gabor Liptak

> Add packages to JUnit output for Python tests
> -
>
> Key: SPARK-11295
> URL: https://issues.apache.org/jira/browse/SPARK-11295
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Reporter: Gabor Liptak
>Assignee: Gabor Liptak
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11295) Add packages to JUnit output for Python tests

2016-01-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11295:
--
Target Version/s: 2.0.0

> Add packages to JUnit output for Python tests
> -
>
> Key: SPARK-11295
> URL: https://issues.apache.org/jira/browse/SPARK-11295
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Reporter: Gabor Liptak
>Assignee: Gabor Liptak
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11295) Add packages to JUnit output for Python tests

2016-01-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11295:
--
Component/s: PySpark

> Add packages to JUnit output for Python tests
> -
>
> Key: SPARK-11295
> URL: https://issues.apache.org/jira/browse/SPARK-11295
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Reporter: Gabor Liptak
>Assignee: Gabor Liptak
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch

2016-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107526#comment-15107526
 ] 

Apache Spark commented on SPARK-6166:
-

User 'redsanket' has created a pull request for this issue:
https://github.com/apache/spark/pull/10838

> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Assignee: Shixiong Zhu
>Priority: Minor
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12845) During join Spark should pushdown predicates on joining column to both tables

2016-01-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107603#comment-15107603
 ] 

Xiao Li commented on SPARK-12845:
-

Let me know if you hit any bug. Thanks!

> During join Spark should pushdown predicates on joining column to both tables
> -
>
> Key: SPARK-12845
> URL: https://issues.apache.org/jira/browse/SPARK-12845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> I have following issue.
> I'm connecting two tables with where condition
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
> {code}
> In this code predicate is only push down to t1.
> To have predicates on both table I should run following query:
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 
> 1234
> {code}
> Spark should present same behaviour for both queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries

2016-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107369#comment-15107369
 ] 

Apache Spark commented on SPARK-12910:
--

User 'napsternxg' has created a pull request for this issue:
https://github.com/apache/spark/pull/10836

> Support for specifying version of R to use while creating sparkR libraries
> --
>
> Key: SPARK-12910
> URL: https://issues.apache.org/jira/browse/SPARK-12910
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
> Environment: Linux
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: installation, sparkR
>
> When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. 
> However, a user might have locally installed their own version of R. There 
> should be a way to specify which R version to use. 
> I have fixed this in my code using the following patch:
> ```
> $ git diff HEAD
> diff --git a/R/README.md b/R/README.md
> index 005f56d..99182e5 100644
> --- a/R/README.md
> +++ b/R/README.md
> @@ -1,6 +1,15 @@
>  # R on Spark
>  
>  SparkR is an R package that provides a light-weight frontend to use Spark 
> from R.
> +### Installing sparkR
> +
> +Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be 
> done by running the script `$SPARK_HOME/R/install-dev.sh`.
> +By default the above script uses the system wide installation of R. However, 
> this can be changed to any user installed location of R by giving the full 
> path of the `$R_HOME` as the first argument to the install-dev.sh script.
> +Example: 
> +```
> +# where /home/username/R is where R is installed and /home/username/R/bin 
> contains the files R and RScript
> +./install-dev.sh /home/username/R 
> +```
>  
>  ### SparkR development
>  
> diff --git a/R/install-dev.sh b/R/install-dev.sh
> index 4972bb9..a8efa86 100755
> --- a/R/install-dev.sh
> +++ b/R/install-dev.sh
> @@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib"
>  mkdir -p $LIB_DIR
>  
>  pushd $FWDIR > /dev/null
> +if [ ! -z "$1" ]
> +  then
> +R_HOME="$1/bin"
> +   else
> +R_HOME="$(dirname $(which R))"
> +fi
> +echo "USING R_HOME = $R_HOME"
>  
>  # Generate Rd files if devtools is installed
> -Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
> +"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
>  
>  # Install SparkR to $LIB_DIR
> -R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
> +"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
>  
>  # Zip the SparkR package so that it can be distributed to worker nodes on 
> YARN
>  cd $LIB_DIR
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10777:


Assignee: Apache Spark

> order by fails when column is aliased and projection includes windowed 
> aggregate
> 
>
> Key: SPARK-10777
> URL: https://issues.apache.org/jira/browse/SPARK-10777
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>Assignee: Apache Spark
>
> This statement fails in SPARK (works fine in ORACLE, DB2 )
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input 
> columns c1, c2; line 3 pos 9
> SQLState:  null
> ErrorCode: 0
> Forcing the aliased column name works around the defect
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> These work fine
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> select r as c1, s  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> create table  if not exists TINT ( RNUM int , CINT int   )
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
>  STORED AS ORC  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10777:


Assignee: (was: Apache Spark)

> order by fails when column is aliased and projection includes windowed 
> aggregate
> 
>
> Key: SPARK-10777
> URL: https://issues.apache.org/jira/browse/SPARK-10777
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> This statement fails in SPARK (works fine in ORACLE, DB2 )
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input 
> columns c1, c2; line 3 pos 9
> SQLState:  null
> ErrorCode: 0
> Forcing the aliased column name works around the defect
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> These work fine
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> select r as c1, s  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> create table  if not exists TINT ( RNUM int , CINT int   )
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
>  STORED AS ORC  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10777) order by fails when column is aliased and projection includes windowed aggregate

2016-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107434#comment-15107434
 ] 

Apache Spark commented on SPARK-10777:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/10678

> order by fails when column is aliased and projection includes windowed 
> aggregate
> 
>
> Key: SPARK-10777
> URL: https://issues.apache.org/jira/browse/SPARK-10777
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> This statement fails in SPARK (works fine in ORACLE, DB2 )
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> Error: org.apache.spark.sql.AnalysisException: cannot resolve 'r' given input 
> columns c1, c2; line 3 pos 9
> SQLState:  null
> ErrorCode: 0
> Forcing the aliased column name works around the defect
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> These work fine
> select r as c1, min ( s ) over ()  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by c1
> select r as c1, s  as c2 from
>   ( select rnum r, sum ( cint ) s from certstring.tint group by rnum ) t
> order by r
> create table  if not exists TINT ( RNUM int , CINT int   )
>  ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' 
>  STORED AS ORC  ;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9716) BinaryClassificationEvaluator should accept Double prediction column

2016-01-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9716.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10472
[https://github.com/apache/spark/pull/10472]

> BinaryClassificationEvaluator should accept Double prediction column
> 
>
> Key: SPARK-9716
> URL: https://issues.apache.org/jira/browse/SPARK-9716
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Benjamin Fradet
>Priority: Minor
> Fix For: 2.0.0
>
>
> BinaryClassificationEvaluator currently expects the rawPrediction column, of 
> type Vector.  It should also accept a Double prediction column, with a 
> different set of supported metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch

2016-01-19 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-6166:
-
Assignee: (was: Shixiong Zhu)

> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Priority: Minor
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12912) Add test suite for EliminateSubQueries

2016-01-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12912:
---

 Summary: Add test suite for EliminateSubQueries
 Key: SPARK-12912
 URL: https://issues.apache.org/jira/browse/SPARK-12912
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12912) Add test suite for EliminateSubQueries

2016-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107606#comment-15107606
 ] 

Apache Spark commented on SPARK-12912:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10837

> Add test suite for EliminateSubQueries
> --
>
> Key: SPARK-12912
> URL: https://issues.apache.org/jira/browse/SPARK-12912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12912) Add test suite for EliminateSubQueries

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12912:


Assignee: Reynold Xin  (was: Apache Spark)

> Add test suite for EliminateSubQueries
> --
>
> Key: SPARK-12912
> URL: https://issues.apache.org/jira/browse/SPARK-12912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12912) Add test suite for EliminateSubQueries

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12912:


Assignee: Apache Spark  (was: Reynold Xin)

> Add test suite for EliminateSubQueries
> --
>
> Key: SPARK-12912
> URL: https://issues.apache.org/jira/browse/SPARK-12912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2750) Add Https support for Web UI

2016-01-19 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-2750.
---
   Resolution: Fixed
 Assignee: Fei Wang
Fix Version/s: 2.0.0

> Add Https support for Web UI
> 
>
> Key: SPARK-2750
> URL: https://issues.apache.org/jira/browse/SPARK-2750
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Tao Wang
>Assignee: Fei Wang
>  Labels: https, ssl, webui
> Fix For: 2.0.0
>
> Attachments: exception on yarn when https enabled.txt
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now I try to add https support for web ui using Jetty ssl integration.Below 
> is the plan:
> 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User 
> can switch between https and http by configure "spark.http.policy" in JVM 
> property for each process, while choose http by default.
> 2.Web port of Master and worker would be decided in order of launch 
> arguments, JVM property, System Env and default port.
> 3.Below is some other configuration items:
> {code}
> spark.ssl.server.keystore.location The file or URL of the SSL Key store
> spark.ssl.server.keystore.password  The password for the key store
> spark.ssl.server.keystore.keypassword The password (if any) for the specific 
> key within the key store
> spark.ssl.server.keystore.type The type of the key store (default "JKS")
> spark.client.https.need-auth True if SSL needs client authentication
> spark.ssl.server.truststore.location The file name or URL of the trust store 
> location
> spark.ssl.server.truststore.password The password for the trust store
> spark.ssl.server.truststore.type The type of the trust store (default "JKS")
> {code}
> Any feedback is welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11295) Add packages to JUnit output for Python tests

2016-01-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11295.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9263
[https://github.com/apache/spark/pull/9263]

> Add packages to JUnit output for Python tests
> -
>
> Key: SPARK-11295
> URL: https://issues.apache.org/jira/browse/SPARK-11295
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: Gabor Liptak
>Priority: Minor
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12845) During join Spark should pushdown predicates on joining column to both tables

2016-01-19 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107600#comment-15107600
 ] 

Xiao Li commented on SPARK-12845:
-

I think the following PR resolves your issue: 
https://github.com/apache/spark/pull/10490
Right?

> During join Spark should pushdown predicates on joining column to both tables
> -
>
> Key: SPARK-12845
> URL: https://issues.apache.org/jira/browse/SPARK-12845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> I have following issue.
> I'm connecting two tables with where condition
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
> {code}
> In this code predicate is only push down to t1.
> To have predicates on both table I should run following query:
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 
> 1234
> {code}
> Spark should present same behaviour for both queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12910) Support for specifying version of R to use while creating sparkR libraries

2016-01-19 Thread Shubhanshu Mishra (JIRA)
Shubhanshu Mishra created SPARK-12910:
-

 Summary: Support for specifying version of R to use while creating 
sparkR libraries
 Key: SPARK-12910
 URL: https://issues.apache.org/jira/browse/SPARK-12910
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
 Environment: Linux
Reporter: Shubhanshu Mishra
Priority: Minor


When we use `$SPARK_HOME/R/install-dev.sh` it uses the default system R. 
However, a user might have locally installed their own version of R. There 
should be a way to specify which R version to use. 


I have fixed this in my code using the following patch:


```
$ git diff HEAD
diff --git a/R/README.md b/R/README.md
index 005f56d..99182e5 100644
--- a/R/README.md
+++ b/R/README.md
@@ -1,6 +1,15 @@
 # R on Spark
 
 SparkR is an R package that provides a light-weight frontend to use Spark from 
R.
+### Installing sparkR
+
+Libraries of sparkR need to be created in `$SPARK_HOME/R/lib`. This can be 
done by running the script `$SPARK_HOME/R/install-dev.sh`.
+By default the above script uses the system wide installation of R. However, 
this can be changed to any user installed location of R by giving the full path 
of the `$R_HOME` as the first argument to the install-dev.sh script.
+Example: 
+```
+# where /home/username/R is where R is installed and /home/username/R/bin 
contains the files R and RScript
+./install-dev.sh /home/username/R 
+```
 
 ### SparkR development
 
diff --git a/R/install-dev.sh b/R/install-dev.sh
index 4972bb9..a8efa86 100755
--- a/R/install-dev.sh
+++ b/R/install-dev.sh
@@ -35,12 +35,19 @@ LIB_DIR="$FWDIR/lib"
 mkdir -p $LIB_DIR
 
 pushd $FWDIR > /dev/null
+if [ ! -z "$1" ]
+  then
+R_HOME="$1/bin"
+   else
+R_HOME="$(dirname $(which R))"
+fi
+echo "USING R_HOME = $R_HOME"
 
 # Generate Rd files if devtools is installed
-Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
+"$R_HOME/"Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
 
 # Install SparkR to $LIB_DIR
-R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
+"$R_HOME/"R CMD INSTALL --library=$LIB_DIR $FWDIR/pkg/
 
 # Zip the SparkR package so that it can be distributed to worker nodes on YARN
 cd $LIB_DIR

```




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12816) Schema generation for type aliases does not work

2016-01-19 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12816.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10749
[https://github.com/apache/spark/pull/10749]

> Schema generation for type aliases does not work
> 
>
> Key: SPARK-12816
> URL: https://issues.apache.org/jira/browse/SPARK-12816
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Jakob Odersky
> Fix For: 2.0.0
>
>
> Related to the second part of SPARK-12777.
> Assume the following:
> {code}
> case class Container[A](a: A)
> type IntContainer = Container[Int]
> {code}
> Generating a schema with 
> {code}org.apache.spark.sql.catalyst.ScalaReflection.schemaFor[IntContainer]{code}
>  fails miserably with {{NoSuchElementException: : head of empty list  
> (ScalaReflection.scala:504)}} (the same exception as described in the related 
> issues)
> Since {{schemaFor}} is called whenever a schema is implicitly needed, 
> {{Datasets}} cannot be created from certain aliased types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

2016-01-19 Thread Jesse English (JIRA)
Jesse English created SPARK-12911:
-

 Summary: Cacheing a dataframe causes array comparisons to fail (in 
filter / where) after 1.6
 Key: SPARK-12911
 URL: https://issues.apache.org/jira/browse/SPARK-12911
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0
 Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
Reporter: Jesse English


When doing a *where* operation on a dataframe and testing for equality on an 
array type, after 1.6 no valid comparisons are made if the dataframe has been 
cached.  If it has not been cached, the results are as expected.

This appears to be related to the underlying unsafe array data types.

{code:title=test.scala|borderStyle=solid}
test("test array comparison") {

val vectors: Vector[Row] =  Vector(
  Row.fromTuple("id_1" -> Array(0L, 2L)),
  Row.fromTuple("id_2" -> Array(0L, 5L)),
  Row.fromTuple("id_3" -> Array(0L, 9L)),
  Row.fromTuple("id_4" -> Array(1L, 0L)),
  Row.fromTuple("id_5" -> Array(1L, 8L)),
  Row.fromTuple("id_6" -> Array(2L, 4L)),
  Row.fromTuple("id_7" -> Array(5L, 6L)),
  Row.fromTuple("id_8" -> Array(6L, 2L)),
  Row.fromTuple("id_9" -> Array(7L, 0L))
)
val data: RDD[Row] = sc.parallelize(vectors, 3)

val schema = StructType(
  StructField("id", StringType, false) ::
StructField("point", DataTypes.createArrayType(LongType, false), false) 
::
Nil
)

val sqlContext = new SQLContext(sc)
val dataframe = sqlContext.createDataFrame(data, schema)

val targetPoint:Array[Long] = Array(0L,9L)

//Cacheing is the trigger to cause the error (no cacheing causes no error)
dataframe.cache()

//This is the line where it fails
//java.util.NoSuchElementException: next on empty iterator
//However we know that there is a valid match
val targetRow = dataframe.where(dataframe("point") === 
array(targetPoint.map(value => lit(value)): _*)).first()

assert(targetRow != null)
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12790) Remove HistoryServer old multiple files format

2016-01-19 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12790:
--
Assignee: Felix Cheung

> Remove HistoryServer old multiple files format
> --
>
> Key: SPARK-12790
> URL: https://issues.apache.org/jira/browse/SPARK-12790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Reporter: Andrew Or
>Assignee: Felix Cheung
>
> HistoryServer has 2 formats. The old one makes a directory and puts multiple 
> files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 
> 1 file called local_2593759238651.log or something.
> It's been a nightmare to maintain both code paths. We should just remove the 
> old legacy format (which has been out of use for many versions now) when we 
> still have the chance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12790) Remove HistoryServer old multiple files format

2016-01-19 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107167#comment-15107167
 ] 

Andrew Or commented on SPARK-12790:
---

also updating all the tests that rely on the old format. If you look under 
core/src/test/resources there are a bunch of those

> Remove HistoryServer old multiple files format
> --
>
> Key: SPARK-12790
> URL: https://issues.apache.org/jira/browse/SPARK-12790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Reporter: Andrew Or
>
> HistoryServer has 2 formats. The old one makes a directory and puts multiple 
> files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 
> 1 file called local_2593759238651.log or something.
> It's been a nightmare to maintain both code paths. We should just remove the 
> old legacy format (which has been out of use for many versions now) when we 
> still have the chance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12790) Remove HistoryServer old multiple files format

2016-01-19 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107169#comment-15107169
 ] 

Andrew Or commented on SPARK-12790:
---

I've assigned this to you.

> Remove HistoryServer old multiple files format
> --
>
> Key: SPARK-12790
> URL: https://issues.apache.org/jira/browse/SPARK-12790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Reporter: Andrew Or
>Assignee: Felix Cheung
>
> HistoryServer has 2 formats. The old one makes a directory and puts multiple 
> files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 
> 1 file called local_2593759238651.log or something.
> It's been a nightmare to maintain both code paths. We should just remove the 
> old legacy format (which has been out of use for many versions now) when we 
> still have the chance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12907) Use BitSet to represent null fields in ColumnVector

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12907:


Assignee: (was: Apache Spark)

> Use BitSet to represent null fields in ColumnVector
> ---
>
> Key: SPARK-12907
> URL: https://issues.apache.org/jira/browse/SPARK-12907
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Use bit vectors (BitSet) to represent null fields information in ColumnVector 
> to reduce memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12907) Use BitSet to represent null fields in ColumnVector

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12907:


Assignee: Apache Spark

> Use BitSet to represent null fields in ColumnVector
> ---
>
> Key: SPARK-12907
> URL: https://issues.apache.org/jira/browse/SPARK-12907
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Minor
>
> Use bit vectors (BitSet) to represent null fields information in ColumnVector 
> to reduce memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12887) Do not expose var's in TaskMetrics

2016-01-19 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12887.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Do not expose var's in TaskMetrics
> --
>
> Key: SPARK-12887
> URL: https://issues.apache.org/jira/browse/SPARK-12887
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> TaskMetrics has a bunch of var's, some are fully public, some are 
> private[spark]. This is bad coding style that makes it easy to accidentally 
> overwrite previously set metrics. This has happened a few times in the past 
> and caused bugs that were difficult to debug.
> Instead, we should have get-or-create semantics, which are more readily 
> understandable. This makes sense in the case of TaskMetrics because these are 
> just aggregated metrics that we want to collect throughout the task, so it 
> doesn't matter *who*'s incrementing them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12907) Use BitSet to represent null fields in ColumnVector

2016-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107200#comment-15107200
 ] 

Apache Spark commented on SPARK-12907:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10833

> Use BitSet to represent null fields in ColumnVector
> ---
>
> Key: SPARK-12907
> URL: https://issues.apache.org/jira/browse/SPARK-12907
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> Use bit vectors (BitSet) to represent null fields information in ColumnVector 
> to reduce memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12804) ml.classification.LogisticRegression fails when FitIntercept with same-label dataset

2016-01-19 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-12804.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10743
[https://github.com/apache/spark/pull/10743]

> ml.classification.LogisticRegression fails when FitIntercept with same-label 
> dataset
> 
>
> Key: SPARK-12804
> URL: https://issues.apache.org/jira/browse/SPARK-12804
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Feynman Liang
>Assignee: Feynman Liang
> Fix For: 2.0.0
>
>
> When training LogisticRegression on a dataset where the label is all 0 or all 
> 1, an array out of bounds exception is thrown. The problematic code is
> {code}
>   initialCoefficientsWithIntercept.toArray(numFeatures)
> = math.log(histogram(1) / histogram(0))
> }
> {code}
> The correct behaviour is to short-circuit training entirely when only a 
> single label is present (can be detected from {{labelSummarizer}}) and return 
> a classifier which assigns all true/false with infinite weights.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12895) Implement TaskMetrics using accumulators

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12895:


Assignee: Apache Spark  (was: Andrew Or)

> Implement TaskMetrics using accumulators
> 
>
> Key: SPARK-12895
> URL: https://issues.apache.org/jira/browse/SPARK-12895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> We need to first do this before we can avoid sending TaskMetrics from the 
> executors to the driver. After we do this, we can send only accumulator 
> updates instead of both that AND TaskMetrics.
> By the end of this issue TaskMetrics will be a wrapper of accumulators. It 
> will be only syntactic sugar for setting these accumulators.
> But first, we need to express everything in TaskMetrics as accumulators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12895) Implement TaskMetrics using accumulators

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12895:


Assignee: Andrew Or  (was: Apache Spark)

> Implement TaskMetrics using accumulators
> 
>
> Key: SPARK-12895
> URL: https://issues.apache.org/jira/browse/SPARK-12895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We need to first do this before we can avoid sending TaskMetrics from the 
> executors to the driver. After we do this, we can send only accumulator 
> updates instead of both that AND TaskMetrics.
> By the end of this issue TaskMetrics will be a wrapper of accumulators. It 
> will be only syntactic sugar for setting these accumulators.
> But first, we need to express everything in TaskMetrics as accumulators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11944) Python API for mllib.clustering.BisectingKMeans

2016-01-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11944.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10150
[https://github.com/apache/spark/pull/10150]

> Python API for mllib.clustering.BisectingKMeans
> ---
>
> Key: SPARK-11944
> URL: https://issues.apache.org/jira/browse/SPARK-11944
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add Python API for mllib.clustering.BisectingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2016-01-19 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107151#comment-15107151
 ] 

Andrew Or commented on SPARK-12485:
---

[~srowen] to answer your question no I don't feel super strongly about changing 
it. Naming is difficult in general and I think both "dynamic allocation" and 
"elastic scaling" do mean roughly the same thing. It's just that I slightly 
prefer the latter (or something shorter) after giving a few talks on this topic 
and chatting with a few people about it in real life. I'm also totally cool 
with closing this as a Won't Fix if you or [~markhamstra] prefer.

> Rename "dynamic allocation" to "elastic scaling"
> 
>
> Key: SPARK-12485
> URL: https://issues.apache.org/jira/browse/SPARK-12485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12870) better format bucket id in file name

2016-01-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12870.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10799
[https://github.com/apache/spark/pull/10799]

> better format bucket id in file name
> 
>
> Key: SPARK-12870
> URL: https://issues.apache.org/jira/browse/SPARK-12870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-19 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107267#comment-15107267
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks Mario! 
bq. We should also have a python/pyspark/streaming/kafka-v09.py as well that 
matches to our external/kafka-v09
I agree, I will look into this.
bq. Why do you have the Broker.scala class? Unless i am missing something, it 
should be knocked off
Yeah, I noticed that too and I agree. This should be pretty simple to take out. 
I also 
[noticed|https://issues.apache.org/jira/browse/SPARK-12177?focusedCommentId=15089750=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15089750]
 that the v09 example picking up some Kafka v08 jars so I am working on fixing 
that too.
bq. I think the package should be 'org.apache.spark.streaming.kafka' only in 
external/kafka-v09 and not 'org.apache.spark.streaming.kafka.v09'. This is 
because we produce a jar with a diff name (user picks which one and even if 
he/she mismatches, it errors correctly since the KafkaUtils method signatures 
are different)
I totally understand what you mean. However, kafka has its [own assembly in 
Spark|https://github.com/apache/spark/tree/master/external/kafka-assembly] and 
the way the code is structured right now, both the new API and old API would go 
in the same assembly so it's important to have a different package name. Also, 
I think for our end users transitioning from old to new API, I foresee them 
having 2 versions of their spark-kafka app. One that works with the old API and 
one with the new API. And, I think it would be an easier transition if they 
could include both the kafka API versions in the spark classpath and pick and 
choose which app to run without mucking with maven dependencies and 
re-compiling when they want to switch. Let me know if you disagree.

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12907) Use BitSet to represent null fields in ColumnVector

2016-01-19 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-12907:


 Summary: Use BitSet to represent null fields in ColumnVector
 Key: SPARK-12907
 URL: https://issues.apache.org/jira/browse/SPARK-12907
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Kazuaki Ishizaki
Priority: Minor


Use bit vectors (BitSet) to represent null fields information in ColumnVector 
to reduce memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-19 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107203#comment-15107203
 ] 

Josh Rosen commented on SPARK-12906:


Ping [~zsxwing], since I know you've looked into similar leaks in the past.

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: dump1.PNG, screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12895) Implement TaskMetrics using accumulators

2016-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107358#comment-15107358
 ] 

Apache Spark commented on SPARK-12895:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/10835

> Implement TaskMetrics using accumulators
> 
>
> Key: SPARK-12895
> URL: https://issues.apache.org/jira/browse/SPARK-12895
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> We need to first do this before we can avoid sending TaskMetrics from the 
> executors to the driver. After we do this, we can send only accumulator 
> updates instead of both that AND TaskMetrics.
> By the end of this issue TaskMetrics will be a wrapper of accumulators. It 
> will be only syntactic sugar for setting these accumulators.
> But first, we need to express everything in TaskMetrics as accumulators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12870) better format bucket id in file name

2016-01-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12870:
-
Assignee: Wenchen Fan

> better format bucket id in file name
> 
>
> Key: SPARK-12870
> URL: https://issues.apache.org/jira/browse/SPARK-12870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12485) Rename "dynamic allocation" to "elastic scaling"

2016-01-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12485.
---
Resolution: Won't Fix

I talked to Andrew more offline. Looks like this name isn't so bad that we have 
to change it. Let's just keep it for now. Thanks.


> Rename "dynamic allocation" to "elastic scaling"
> 
>
> Key: SPARK-12485
> URL: https://issues.apache.org/jira/browse/SPARK-12485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> Fewer syllables, sounds more natural.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12908) Add tests to make sure that ml.classification.LogisticRegression returns meaningful result when labels are the same without intercept

2016-01-19 Thread DB Tsai (JIRA)
DB Tsai created SPARK-12908:
---

 Summary: Add tests to make sure that 
ml.classification.LogisticRegression returns meaningful result when labels are 
the same without intercept
 Key: SPARK-12908
 URL: https://issues.apache.org/jira/browse/SPARK-12908
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.6.0
Reporter: DB Tsai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in yarn-cluster mode

2016-01-19 Thread John Vines (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107061#comment-15107061
 ] 

John Vines commented on SPARK-12650:


SPARK_SUBMIT_OPTS seems to work. -Xmx256m changed the heap settings for 
SparkSubmitJob, but left the driver alone and did not appear to cause the same 
conflict in the executors as mentioned above. I also did not see any logging 
about that setting (unlike SPARK_JAVA_OPTS which I mentioned above).

> No means to specify Xmx settings for SparkSubmit in yarn-cluster mode
> -
>
> Key: SPARK-12650
> URL: https://issues.apache.org/jira/browse/SPARK-12650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.2
> Environment: Hadoop 2.6.0
>Reporter: John Vines
>
> Background-
> I have an app master designed to do some work and then launch a spark job.
> Issue-
> If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, 
> leading to the jvm taking a default heap which is relatively large. This 
> causes a large amount of vmem to be taken, so that it is killed by yarn. This 
> can be worked around by disabling Yarn's vmem check, but that is a hack.
> If I run it in yarn-client mode, it's fine as long as my container has enough 
> space for the driver, which is manageable. But I feel that the utter lack of 
> Xmx settings for what I believe is a very small jvm is a problem.
> I believe this was introduced with the fix for SPARK-3884



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12783) Dataset map serialization error

2016-01-19 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107178#comment-15107178
 ] 

Wenchen Fan commented on SPARK-12783:
-

hi [~babloo80], can you move `MyMap` and `TestCaseClass` to top level(don't 
make them inner class) and try again? I can't reproduce your failure locally...

> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;
>   - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> name: arguments, type: interface scala.collection.Seq)
>   - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> staticinvoke(class 
> org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface 
> scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> 

[jira] [Updated] (SPARK-12908) Add tests to make sure that ml.classification.LogisticRegression returns meaningful result when labels are the same without intercept

2016-01-19 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-12908:

Description: This will be only adding new tests as followup PR to 
https://github.com/apache/spark/pull/10743

> Add tests to make sure that ml.classification.LogisticRegression returns 
> meaningful result when labels are the same without intercept
> -
>
> Key: SPARK-12908
> URL: https://issues.apache.org/jira/browse/SPARK-12908
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: DB Tsai
>
> This will be only adding new tests as followup PR to 
> https://github.com/apache/spark/pull/10743



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12883) 1.6 Dynamic allocation document for removing executors with cached data differs in different sections

2016-01-19 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107085#comment-15107085
 ] 

Saisai Shao commented on SPARK-12883:
-

I get your point now. But I think these two descriptions are still both valid, 
the first paragraph describes the result of data cached executor removing, and 
the second paragraph says how to workaround this problem. Maybe just different 
understanding from different people.

> 1.6 Dynamic allocation document for removing executors with cached data 
> differs in different sections
> -
>
> Key: SPARK-12883
> URL: https://issues.apache.org/jira/browse/SPARK-12883
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Manoj Samel
>Priority: Trivial
>
> Spark 1.6 dynamic allocation documentation still refers to 1.2. 
> See text "There is currently not yet a solution for this in Spark 1.2. In 
> future releases, the cached data may be preserved through an off-heap storage 
> similar in spirit to how shuffle files are preserved through the external 
> shuffle service"
> It appears 1.6 has parameter to address cache executor 
> spark.dynamicAllocation.cachedExecutorIdleTimeout with default value as 
> infinity.
> Pl update 1.6 documentation to refer to latest release and features



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12883) 1.6 Dynamic allocation document for removing executors with cached data differs in different sections

2016-01-19 Thread Manoj Samel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107057#comment-15107057
 ] 

Manoj Samel commented on SPARK-12883:
-

Updated Jira subject for more accurate reflection of the issue

> 1.6 Dynamic allocation document for removing executors with cached data 
> differs in different sections
> -
>
> Key: SPARK-12883
> URL: https://issues.apache.org/jira/browse/SPARK-12883
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Manoj Samel
>Priority: Trivial
>
> Spark 1.6 dynamic allocation documentation still refers to 1.2. 
> See text "There is currently not yet a solution for this in Spark 1.2. In 
> future releases, the cached data may be preserved through an off-heap storage 
> similar in spirit to how shuffle files are preserved through the external 
> shuffle service"
> It appears 1.6 has parameter to address cache executor 
> spark.dynamicAllocation.cachedExecutorIdleTimeout with default value as 
> infinity.
> Pl update 1.6 documentation to refer to latest release and features



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12826) Spark Workers do not attempt reconnect or exit on connection failure.

2016-01-19 Thread Alan Braithwaite (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Braithwaite updated SPARK-12826:
-
Priority: Critical  (was: Major)

> Spark Workers do not attempt reconnect or exit on connection failure.
> -
>
> Key: SPARK-12826
> URL: https://issues.apache.org/jira/browse/SPARK-12826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Alan Braithwaite
>Priority: Critical
>
> Spark version 1.6.0 Hadoop 2.6.0 CDH 5.4.2
> We're running behind a tcp proxy (10.14.12.11:7077 is the tcp proxy listen 
> address in the example, upstreaming to the spark master listening on 9682 and 
> a different IP)
> To reproduce, I started a spark worker, let it successfully connect to the 
> master through the proxy, then tcpkill'd the connection on the Worker.  
> Nothing is logged from the code handling reconnection attempts.
> {code}
> 16/01/14 18:23:30 INFO Worker: Connecting to master 
> spark-master.example.com:7077...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Creating new connection to 
> spark-master.example.com/10.14.12.11:7077
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Connection to 
> spark-master.example.com/10.14.12.11:7077 successful, running bootstraps...
> 16/01/14 18:23:30 DEBUG TransportClientFactory: Successfully created 
> connection to spark-master.example.com/10.14.12.11:7077 after 1 ms (0 ms 
> spent in bootstraps)
> 16/01/14 18:23:30 DEBUG Recycler: -Dio.netty.recycler.maxCapacity.default: 
> 262144
> 16/01/14 18:23:30 INFO Worker: Successfully registered with master 
> spark://0.0.0.0:9682
> 16/01/14 18:23:30 INFO Worker: Worker cleanup enabled; old application 
> directories will be deleted in: /var/lib/spark/work
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:52 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:36:57 DEBUG SecurityManager: user=null aclsEnabled=false 
> viewAcls=spark
> 16/01/14 18:41:31 WARN TransportChannelHandler: Exception in connection from 
> spark-master.example.com/10.14.12.11:7077
> java.io.IOException: Connection reset by peer
>   at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>   at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>   at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>   at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>   at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>   at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
>   at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:745)
> -- nothing more is logged, going on 15 minutes --
> $ ag -C5 Disconn 
> core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
> 313registrationRetryTimer.foreach(_.cancel(true))
> 314registrationRetryTimer = None
> 315  }
> 316
> 317  private def registerWithMaster() {
> 318// onDisconnected may be triggered multiple times, so don't attempt 
> registration
> 319// if there are outstanding registration attempts scheduled.
> 320registrationRetryTimer match {
> 321  case None =>
> 322registered = false
> 323registerMasterFutures = tryRegisterAllMasters()
> --
> 549finishedExecutors.values.toList, drivers.values.toList,
> 550finishedDrivers.values.toList, activeMasterUrl, cores, memory,
> 551coresUsed, memoryUsed, activeMasterWebUiUrl))
> 552  }
> 553
> 554  override def onDisconnected(remoteAddress: RpcAddress): Unit = {
> 555if (master.exists(_.address == remoteAddress)) {
> 556  logInfo(s"$remoteAddress Disassociated !")
> 557  masterDisconnected()
> 558}
> 559  }
> 560
> 561  private def masterDisconnected() {
> 562logError("Connection to master failed! Waiting for master to 
> reconnect...")
> 563connected = false
> 564registerWithMaster()
> 565  }
> 

[jira] [Commented] (SPARK-12546) Writing to partitioned parquet table can fail with OOM

2016-01-19 Thread Nong Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107258#comment-15107258
 ] 

Nong Li commented on SPARK-12546:
-

A better workaround might be to figure the max number of concurrent output 
files to 1. This can be done by setting 
"spark.sql.sources.maxConcurrentWrites=1"

> Writing to partitioned parquet table can fail with OOM
> --
>
> Key: SPARK-12546
> URL: https://issues.apache.org/jira/browse/SPARK-12546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Nong Li
>
> It is possible to have jobs fail with OOM when writing to a partitioned 
> parquet table. While this was probably always possible, it is more likely in 
> 1.6 due to the memory manager changes. The unified memory manager enables 
> Spark to use more of the process memory (in particular, for execution) which 
> gets us in this state more often. This issue can happen for libraries that 
> consume a lot of memory, such as parquet. Prior to 1.6, these libraries would 
> more likely use memory that spark was not using (i.e. from the storage pool). 
> In 1.6, this storage memory can now be used for execution.
> There are a couple of configs that can help with this issue.
>   - parquet.memory.pool.ratio: This is a parquet config on how much of the 
> heap the parquet writers should use. This default to .95. Consider a much 
> lower value (e.g. 0.1)
>   - spark.memory.faction: This is a spark config to control how much of the 
> memory should be allocated to spark. Consider setting this to 0.6.
> This should cause jobs to potentially spill more but require less memory. 
> More aggressive tuning will control this trade off.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12867) Nullability of Intersect can be stricter

2016-01-19 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-12867.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10812
[https://github.com/apache/spark/pull/10812]

> Nullability of Intersect can be stricter
> 
>
> Key: SPARK-12867
> URL: https://issues.apache.org/jira/browse/SPARK-12867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{Intersect}} doesn't override {{SetOperation.output}}, which is defined as:
> {code}
>   override def output: Seq[Attribute] =
> left.output.zip(right.output).map { case (leftAttr, rightAttr) =>
>   leftAttr.withNullability(leftAttr.nullable || rightAttr.nullable)
> }
> {code}
> However, we can replace the {{||}} with {{&&}} for intersection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12909) Spark on Mesos accessing Secured HDFS w/Kerberos

2016-01-19 Thread Greg Senia (JIRA)
Greg Senia created SPARK-12909:
--

 Summary: Spark on Mesos accessing Secured HDFS w/Kerberos
 Key: SPARK-12909
 URL: https://issues.apache.org/jira/browse/SPARK-12909
 Project: Spark
  Issue Type: New Feature
  Components: Mesos
Reporter: Greg Senia


Ability for Spark on Mesos to use a Kerberized HDFS FileSystem for data It 
seems like this is not possible based on email chains and forum articles? If 
these are true how hard would it be to get this implemented I'm willing to try 
to help.

https://community.hortonworks.com/questions/5415/spark-on-yarn-vs-mesos.html

https://www.mail-archive.com/user@spark.apache.org/msg31326.html





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12797) Aggregation without grouping keys

2016-01-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12797:


Assignee: Apache Spark

> Aggregation without grouping keys
> -
>
> Key: SPARK-12797
> URL: https://issues.apache.org/jira/browse/SPARK-12797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12770) Implement rules for branch elimination for CaseWhen in SimplifyConditionals

2016-01-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12770.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Implement rules for branch elimination for CaseWhen in SimplifyConditionals
> ---
>
> Key: SPARK-12770
> URL: https://issues.apache.org/jira/browse/SPARK-12770
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> There are a few things we can do:
> 1. If the first branch's condition is a true literal, remove the CaseWhen and 
> use the value from that branch.
> 2. If a branch's condition is a false or null literal, remove that branch.
> 3. If only the else branch is left, remove the CaseWhen and use the value 
> from the else branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12168) Need test for conflicted function in R

2016-01-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12168:
--
Assignee: Felix Cheung

> Need test for conflicted function in R
> --
>
> Key: SPARK-12168
> URL: https://issues.apache.org/jira/browse/SPARK-12168
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently it is hard to know if a function in base or stats packages are 
> masked when add new function in SparkR.
> Having an automated test would make it easier to track such changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12168) Need test for conflicted function in R

2016-01-19 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12168.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10171
[https://github.com/apache/spark/pull/10171]

> Need test for conflicted function in R
> --
>
> Key: SPARK-12168
> URL: https://issues.apache.org/jira/browse/SPARK-12168
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently it is hard to know if a function in base or stats packages are 
> masked when add new function in SparkR.
> Having an automated test would make it easier to track such changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12898) Consider having dummyCallSite for HiveTableScan

2016-01-19 Thread Rajesh Balamohan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated SPARK-12898:
-
Attachment: callsiteProf.png

> Consider having dummyCallSite for HiveTableScan
> ---
>
> Key: SPARK-12898
> URL: https://issues.apache.org/jira/browse/SPARK-12898
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
> Attachments: callsiteProf.png
>
>
> Currently, it runs with getCallSite which is really expensive and shows up 
> when scanning through large table with partitions (e.g TPC-DS). It would be 
> good to consider having dummyCallSite in HiveTableScan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12864) initialize executorIdCounter after ApplicationMaster killed for max number of executor failures reached

2016-01-19 Thread iward (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107859#comment-15107859
 ] 

iward commented on SPARK-12864:
---

The important point of the idea is to fix this conflict executor id. As the 
task log show, if the data of current task to shuffle is not found, it will 
throw a FetchFailedException. So, I think the mechanism in AM restarted case is 
continue to run, it won't recomputed, if the data computed is not found, it 
will throw a FetchFailedException. And I have run a test that it will normally 
continue to run.


>  initialize executorIdCounter after ApplicationMaster killed for max number 
> of executor failures reached
> 
>
> Key: SPARK-12864
> URL: https://issues.apache.org/jira/browse/SPARK-12864
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
>Reporter: iward
>
> Currently, when max number of executor failures reached the 
> *maxNumExecutorFailures*,  *ApplicationMaster* will be killed and re-register 
> another one.This time, *YarnAllocator* will be created a new instance.
> But, the value of property *executorIdCounter* in  *YarnAllocator* will reset 
> to *0*. Then the *Id* of new executor will starting from 1. This will confuse 
> with the executor has already created before, which will cause 
> FetchFailedException.
> For example, the following is the task log:
> {noformat}
> 2015-12-22 02:33:15 INFO 15/12/22 02:33:15 WARN 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has 
> disassociated: 172.22.92.14:45125
> 2015-12-22 02:33:26 INFO 15/12/22 02:33:26 INFO 
> YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkYarnAM@172.22.168.72:54040/user/YarnAM#-1290854604])
> {noformat}
> {noformat}
> 2015-12-22 02:35:02 INFO 15/12/22 02:35:02 INFO YarnClientSchedulerBackend: 
> Registered executor: 
> AkkaRpcEndpointRef(Actor[akka.tcp://sparkexecu...@bjhc-hera-16217.hadoop.jd.local:46538/user/Executor#-790726793])
>  with ID 1
> {noformat}
> {noformat}
> Lost task 3.0 in stage 102.0 (TID 1963, BJHC-HERA-16217.hadoop.jd.local): 
> FetchFailed(BlockManagerId(1, BJHC-HERA-17030.hadoop.jd.local, 7337
> ), shuffleId=5, mapId=2, reduceId=3, message=
> 2015-12-22 02:43:20 INFO org.apache.spark.shuffle.FetchFailedException: 
> /data3/yarn1/local/usercache/dd_edw/appcache/application_1450438154359_206399/blockmgr-b1fd0363-6d53-4d09-8086-adc4a13f4dc4/0f/shuffl
> e_5_2_0.index (No such file or directory)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:84)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:154)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:149)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-22 02:43:20 INFO at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> {noformat}
> As the task log show, the executor id of  *BJHC-HERA-16217.hadoop.jd.local* 
> is the same as *BJHC-HERA-17030.hadoop.jd.local*. So, it is confusion and 
> cause FetchFailedException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12669) Organize options for default values

2016-01-19 Thread Mohit Jaggi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15107887#comment-15107887
 ] 

Mohit Jaggi commented on SPARK-12669:
-

hmm...wouldn't it be good to have a typesafe API as well in addition to this 
one? It can be a utility on top of this API. Maps are a bit hard to use as you 
don't get auto-completion from IDEs, no compile time checks etc.

> Organize options for default values
> ---
>
> Key: SPARK-12669
> URL: https://issues.apache.org/jira/browse/SPARK-12669
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> CSV data source in SparkSQL should be able to differentiate empty string, 
> null, NaN, “N/A” (maybe data type dependent).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12913) Reimplement all builtin aggregate functions as declarative function

2016-01-19 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12913:
--

 Summary: Reimplement all builtin aggregate functions as 
declarative function
 Key: SPARK-12913
 URL: https://issues.apache.org/jira/browse/SPARK-12913
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


As benchmarked and discussed here: 
https://github.com/apache/spark/pull/10786/files#r50038294.

Benefits from codegen, the declarative aggregate function could be much faster 
than imperative one,  we should re-implement all the builtin aggregate 
functions as declarative one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >