[jira] [Commented] (SPARK-16296) add null check for key when create map data in encoder

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355130#comment-15355130
 ] 

Apache Spark commented on SPARK-16296:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13974

> add null check for key when create map data in encoder
> --
>
> Key: SPARK-16296
> URL: https://issues.apache.org/jira/browse/SPARK-16296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16296) add null check for key when create map data in encoder

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16296:


Assignee: Wenchen Fan  (was: Apache Spark)

> add null check for key when create map data in encoder
> --
>
> Key: SPARK-16296
> URL: https://issues.apache.org/jira/browse/SPARK-16296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16296) add null check for key when create map data in encoder

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16296:


Assignee: Apache Spark  (was: Wenchen Fan)

> add null check for key when create map data in encoder
> --
>
> Key: SPARK-16296
> URL: https://issues.apache.org/jira/browse/SPARK-16296
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16297) Mapping Boolean and string to BIT and NVARCHAR(MAX) for SQL Server jdbc dialect

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16297:


Assignee: (was: Apache Spark)

> Mapping Boolean and string  to BIT and NVARCHAR(MAX) for SQL Server jdbc 
> dialect
> 
>
> Key: SPARK-16297
> URL: https://issues.apache.org/jira/browse/SPARK-16297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Oussama Mekni
>  Labels: patch
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Tested with SQLServer 2012 and SQLServer Express:
> - Fix mapping of StringType to NVARCHAR(MAX)
> - Fix mapping of BooleanTypeto BIT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16297) Mapping Boolean and string to BIT and NVARCHAR(MAX) for SQL Server jdbc dialect

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355227#comment-15355227
 ] 

Apache Spark commented on SPARK-16297:
--

User 'meknio' has created a pull request for this issue:
https://github.com/apache/spark/pull/13944

> Mapping Boolean and string  to BIT and NVARCHAR(MAX) for SQL Server jdbc 
> dialect
> 
>
> Key: SPARK-16297
> URL: https://issues.apache.org/jira/browse/SPARK-16297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Oussama Mekni
>  Labels: patch
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Tested with SQLServer 2012 and SQLServer Express:
> - Fix mapping of StringType to NVARCHAR(MAX)
> - Fix mapping of BooleanTypeto BIT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16297) Mapping Boolean and string to BIT and NVARCHAR(MAX) for SQL Server jdbc dialect

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16297:


Assignee: Apache Spark

> Mapping Boolean and string  to BIT and NVARCHAR(MAX) for SQL Server jdbc 
> dialect
> 
>
> Key: SPARK-16297
> URL: https://issues.apache.org/jira/browse/SPARK-16297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Oussama Mekni
>Assignee: Apache Spark
>  Labels: patch
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Tested with SQLServer 2012 and SQLServer Express:
> - Fix mapping of StringType to NVARCHAR(MAX)
> - Fix mapping of BooleanTypeto BIT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355462#comment-15355462
 ] 

Apache Spark commented on SPARK-16299:
--

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/13975

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16299:


Assignee: Apache Spark

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>Assignee: Apache Spark
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16299:


Assignee: (was: Apache Spark)

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16299:


Assignee: Apache Spark

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>Assignee: Apache Spark
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16299:


Assignee: (was: Apache Spark)

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16288) Implement inline table generating function

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16288:


Assignee: (was: Apache Spark)

> Implement inline table generating function
> --
>
> Key: SPARK-16288
> URL: https://issues.apache.org/jira/browse/SPARK-16288
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16288) Implement inline table generating function

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16288:


Assignee: Apache Spark

> Implement inline table generating function
> --
>
> Key: SPARK-16288
> URL: https://issues.apache.org/jira/browse/SPARK-16288
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16288) Implement inline table generating function

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355560#comment-15355560
 ] 

Apache Spark commented on SPARK-16288:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13976

> Implement inline table generating function
> --
>
> Key: SPARK-16288
> URL: https://issues.apache.org/jira/browse/SPARK-16288
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16301) Analyzer rule for resolving using joins should respect case sensitivity setting

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355596#comment-15355596
 ] 

Apache Spark commented on SPARK-16301:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/13977

> Analyzer rule for resolving using joins should respect case sensitivity 
> setting
> ---
>
> Key: SPARK-16301
> URL: https://issues.apache.org/jira/browse/SPARK-16301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Quick repro: Passes on Spark 1.6.x, but fails on 2.0
> {code}
> case class MyColumn(userId: Int, field: String)
> val ds = Seq(MyColumn(1, "a")).toDF
> ds.join(ds, Seq("userid"))
> {code}
> {code}
> stacktrace:
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:313)
>   at scala.None$.get(Option.scala:311)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$88.apply(Analyzer.scala:1844)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$88.apply(Analyzer.scala:1844)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16256) Add Structured Streaming Programming Guide

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355629#comment-15355629
 ] 

Apache Spark commented on SPARK-16256:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13978

> Add Structured Streaming Programming Guide
> --
>
> Key: SPARK-16256
> URL: https://issues.apache.org/jira/browse/SPARK-16256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16302) Set the right number of partitions for reading data from a local collection.

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16302:


Assignee: Apache Spark

> Set the right number of partitions for reading data from a local collection.
> 
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16302) Set the right number of partitions for reading data from a local collection.

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355662#comment-15355662
 ] 

Apache Spark commented on SPARK-16302:
--

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/13979

> Set the right number of partitions for reading data from a local collection.
> 
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16302) Set the right number of partitions for reading data from a local collection.

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16302:


Assignee: (was: Apache Spark)

> Set the right number of partitions for reading data from a local collection.
> 
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16198) Change the access level of the predict method in spark.ml.Predictor to public

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16198:


Assignee: (was: Apache Spark)

> Change the access level of the predict method in spark.ml.Predictor to public
> -
>
> Key: SPARK-16198
> URL: https://issues.apache.org/jira/browse/SPARK-16198
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hussein Hazimeh
>Priority: Minor
>  Labels: latency, performance
>
> h1. Summary
> The transform method of predictors in spark.ml has a relatively high latency 
> when predicting single instances or small batches, which is mainly due to the 
> overhead introduced by DataFrame operations. For a text classification task 
> on the RCV1 datatset, changing the access level of the low-level "predict" 
> method from protected to public and using it to make predictions reduced the 
> latency of single predictions by three to four folds and that of batches by 
> 50%. While the transform method is flexible and sufficient for general usage, 
> exposing the low-level predict method to the public API can benefit many 
> applications that require low-latency response.
> h1. Experiment
> I performed an experiment to measure the latency of single instance 
> predictions in Spark and some other popular ML toolkits. Specifically, I'm 
> looking at the the time it takes to predict or classify a feature vector 
> residing in memory after the model is trained.
> For each toolkit in the table below, logistic regression was trained on the 
> Reuters RCV1 dataset which contains 697,641 documents and 47,236 features 
> stored in LIBSVM format along with binary labels. Then the wall-clock time 
> required to classify each document in a sample of 100,000 documents is 
> measured, and the 50th, 90th, and 99th percentiles and the maximum time are 
> reported. 
> All toolkits were tested on a desktop machine with an i7-6700 processor and 
> 16 GB memory, running Ubuntu 14.04 and OpenBLAS. The wall clock resolution is 
> 80ns for Python and 20ns for Scala.
> h1. Results
> The table below shows the latency of predictions for single instances in 
> milliseconds, sorted by P90. Spark and Spark 2 refer to versions 1.6.1 and 
> 2.0.0-SNAPSHOT (on master), respectively. In {color:blue}Spark 
> (Modified){color} and {color:blue}Spark 2 (Modified){color},  I changed the 
> access level of the predict method from protected to public and used it to 
> perform the predictions instead of transform. 
> ||Toolkit||API||P50||P90||P99||Max||
> |Spark|MLLIB (Scala)|0.0002|0.0015|0.0028|0.0685|
> |{color:blue}Spark 2 (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0004|0.0031|0.0087|0.3979|
> |{color:blue}Spark (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0013|0.0061|0.0632|0.4924|
> |Spark|MLLIB (Python)|0.0065|0.0075|0.0212|0.1579|
> |Scikit-Learn|Python|0.0341|0.0460|0.0849|0.2285|
> |LIBLINEAR|Python|0.0669|0.1484|0.2623|1.7322|
> |{color:red}Spark{color}|{color:red}ML 
> (Scala){color}|2.3713|2.9577|4.6521|511.2448|
> |{color:red}Spark 2{color}|{color:red}ML 
> (Scala){color}|8.4603|9.4352|13.2143|292.8733|
> |BIDMach (CPU)|Scala|5.4061|49.1362|102.2563|12040.6773|
> |BIDMach (GPU)|Scala|471.3460|477.8214|485.9805|807.4782|
> The results show that spark.mllib has the lowest latency among all other 
> toolkits and APIs, and this can be attributed to its low-level prediction 
> function that operates directly on the feature vector. However, spark.ml has 
> a relatively high latency which is in the order of 3ms for Spark 1.6.1 and 
> 10ms for Spark 2.0.0. Profiling the transform method of logistic regression 
> in spark.ml showed that only 0.01% of the time is being spent in doing the 
> dot product and logit transformation, while the rest of the time is dominated 
> by the DataFrame operations (mostly the “withColumn” operation that appends 
> the predictions column(s) to the input DataFrame). The results of the 
> modified versions of spark.ml, which directly use the predict method, 
> validate this observation as the latency is reduced by three to four folds.
> Since Spark splits batch predictions into a series of single-instance 
> predictions, reducing the latency of single predictions can lead to lower 
> latencies in batch predictions. I tried batch predictions in spark.ml (1.6.1) 
> using testing_features.map(x => model.predict( x)).collect() instead of 
> model.transform(testing_dataframe).select(“prediction”).collect(), and the 
> former had roughly 50% less latency for batches of size 1000, 10,000, and 
> 100,000.
> Although the experiment is constrained to logistic regression, other 
> predictors in the classification, regression, and clustering modules can 
> suffer from the same problem a

[jira] [Assigned] (SPARK-16198) Change the access level of the predict method in spark.ml.Predictor to public

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16198:


Assignee: Apache Spark

> Change the access level of the predict method in spark.ml.Predictor to public
> -
>
> Key: SPARK-16198
> URL: https://issues.apache.org/jira/browse/SPARK-16198
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hussein Hazimeh
>Assignee: Apache Spark
>Priority: Minor
>  Labels: latency, performance
>
> h1. Summary
> The transform method of predictors in spark.ml has a relatively high latency 
> when predicting single instances or small batches, which is mainly due to the 
> overhead introduced by DataFrame operations. For a text classification task 
> on the RCV1 datatset, changing the access level of the low-level "predict" 
> method from protected to public and using it to make predictions reduced the 
> latency of single predictions by three to four folds and that of batches by 
> 50%. While the transform method is flexible and sufficient for general usage, 
> exposing the low-level predict method to the public API can benefit many 
> applications that require low-latency response.
> h1. Experiment
> I performed an experiment to measure the latency of single instance 
> predictions in Spark and some other popular ML toolkits. Specifically, I'm 
> looking at the the time it takes to predict or classify a feature vector 
> residing in memory after the model is trained.
> For each toolkit in the table below, logistic regression was trained on the 
> Reuters RCV1 dataset which contains 697,641 documents and 47,236 features 
> stored in LIBSVM format along with binary labels. Then the wall-clock time 
> required to classify each document in a sample of 100,000 documents is 
> measured, and the 50th, 90th, and 99th percentiles and the maximum time are 
> reported. 
> All toolkits were tested on a desktop machine with an i7-6700 processor and 
> 16 GB memory, running Ubuntu 14.04 and OpenBLAS. The wall clock resolution is 
> 80ns for Python and 20ns for Scala.
> h1. Results
> The table below shows the latency of predictions for single instances in 
> milliseconds, sorted by P90. Spark and Spark 2 refer to versions 1.6.1 and 
> 2.0.0-SNAPSHOT (on master), respectively. In {color:blue}Spark 
> (Modified){color} and {color:blue}Spark 2 (Modified){color},  I changed the 
> access level of the predict method from protected to public and used it to 
> perform the predictions instead of transform. 
> ||Toolkit||API||P50||P90||P99||Max||
> |Spark|MLLIB (Scala)|0.0002|0.0015|0.0028|0.0685|
> |{color:blue}Spark 2 (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0004|0.0031|0.0087|0.3979|
> |{color:blue}Spark (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0013|0.0061|0.0632|0.4924|
> |Spark|MLLIB (Python)|0.0065|0.0075|0.0212|0.1579|
> |Scikit-Learn|Python|0.0341|0.0460|0.0849|0.2285|
> |LIBLINEAR|Python|0.0669|0.1484|0.2623|1.7322|
> |{color:red}Spark{color}|{color:red}ML 
> (Scala){color}|2.3713|2.9577|4.6521|511.2448|
> |{color:red}Spark 2{color}|{color:red}ML 
> (Scala){color}|8.4603|9.4352|13.2143|292.8733|
> |BIDMach (CPU)|Scala|5.4061|49.1362|102.2563|12040.6773|
> |BIDMach (GPU)|Scala|471.3460|477.8214|485.9805|807.4782|
> The results show that spark.mllib has the lowest latency among all other 
> toolkits and APIs, and this can be attributed to its low-level prediction 
> function that operates directly on the feature vector. However, spark.ml has 
> a relatively high latency which is in the order of 3ms for Spark 1.6.1 and 
> 10ms for Spark 2.0.0. Profiling the transform method of logistic regression 
> in spark.ml showed that only 0.01% of the time is being spent in doing the 
> dot product and logit transformation, while the rest of the time is dominated 
> by the DataFrame operations (mostly the “withColumn” operation that appends 
> the predictions column(s) to the input DataFrame). The results of the 
> modified versions of spark.ml, which directly use the predict method, 
> validate this observation as the latency is reduced by three to four folds.
> Since Spark splits batch predictions into a series of single-instance 
> predictions, reducing the latency of single predictions can lead to lower 
> latencies in batch predictions. I tried batch predictions in spark.ml (1.6.1) 
> using testing_features.map(x => model.predict( x)).collect() instead of 
> model.transform(testing_dataframe).select(“prediction”).collect(), and the 
> former had roughly 50% less latency for batches of size 1000, 10,000, and 
> 100,000.
> Although the experiment is constrained to logistic regression, other 
> predictors in the classification, regression, and clustering modules can 
> suffe

[jira] [Commented] (SPARK-16198) Change the access level of the predict method in spark.ml.Predictor to public

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355687#comment-15355687
 ] 

Apache Spark commented on SPARK-16198:
--

User 'husseinhazimeh' has created a pull request for this issue:
https://github.com/apache/spark/pull/13980

> Change the access level of the predict method in spark.ml.Predictor to public
> -
>
> Key: SPARK-16198
> URL: https://issues.apache.org/jira/browse/SPARK-16198
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hussein Hazimeh
>Priority: Minor
>  Labels: latency, performance
>
> h1. Summary
> The transform method of predictors in spark.ml has a relatively high latency 
> when predicting single instances or small batches, which is mainly due to the 
> overhead introduced by DataFrame operations. For a text classification task 
> on the RCV1 datatset, changing the access level of the low-level "predict" 
> method from protected to public and using it to make predictions reduced the 
> latency of single predictions by three to four folds and that of batches by 
> 50%. While the transform method is flexible and sufficient for general usage, 
> exposing the low-level predict method to the public API can benefit many 
> applications that require low-latency response.
> h1. Experiment
> I performed an experiment to measure the latency of single instance 
> predictions in Spark and some other popular ML toolkits. Specifically, I'm 
> looking at the the time it takes to predict or classify a feature vector 
> residing in memory after the model is trained.
> For each toolkit in the table below, logistic regression was trained on the 
> Reuters RCV1 dataset which contains 697,641 documents and 47,236 features 
> stored in LIBSVM format along with binary labels. Then the wall-clock time 
> required to classify each document in a sample of 100,000 documents is 
> measured, and the 50th, 90th, and 99th percentiles and the maximum time are 
> reported. 
> All toolkits were tested on a desktop machine with an i7-6700 processor and 
> 16 GB memory, running Ubuntu 14.04 and OpenBLAS. The wall clock resolution is 
> 80ns for Python and 20ns for Scala.
> h1. Results
> The table below shows the latency of predictions for single instances in 
> milliseconds, sorted by P90. Spark and Spark 2 refer to versions 1.6.1 and 
> 2.0.0-SNAPSHOT (on master), respectively. In {color:blue}Spark 
> (Modified){color} and {color:blue}Spark 2 (Modified){color},  I changed the 
> access level of the predict method from protected to public and used it to 
> perform the predictions instead of transform. 
> ||Toolkit||API||P50||P90||P99||Max||
> |Spark|MLLIB (Scala)|0.0002|0.0015|0.0028|0.0685|
> |{color:blue}Spark 2 (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0004|0.0031|0.0087|0.3979|
> |{color:blue}Spark (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0013|0.0061|0.0632|0.4924|
> |Spark|MLLIB (Python)|0.0065|0.0075|0.0212|0.1579|
> |Scikit-Learn|Python|0.0341|0.0460|0.0849|0.2285|
> |LIBLINEAR|Python|0.0669|0.1484|0.2623|1.7322|
> |{color:red}Spark{color}|{color:red}ML 
> (Scala){color}|2.3713|2.9577|4.6521|511.2448|
> |{color:red}Spark 2{color}|{color:red}ML 
> (Scala){color}|8.4603|9.4352|13.2143|292.8733|
> |BIDMach (CPU)|Scala|5.4061|49.1362|102.2563|12040.6773|
> |BIDMach (GPU)|Scala|471.3460|477.8214|485.9805|807.4782|
> The results show that spark.mllib has the lowest latency among all other 
> toolkits and APIs, and this can be attributed to its low-level prediction 
> function that operates directly on the feature vector. However, spark.ml has 
> a relatively high latency which is in the order of 3ms for Spark 1.6.1 and 
> 10ms for Spark 2.0.0. Profiling the transform method of logistic regression 
> in spark.ml showed that only 0.01% of the time is being spent in doing the 
> dot product and logit transformation, while the rest of the time is dominated 
> by the DataFrame operations (mostly the “withColumn” operation that appends 
> the predictions column(s) to the input DataFrame). The results of the 
> modified versions of spark.ml, which directly use the predict method, 
> validate this observation as the latency is reduced by three to four folds.
> Since Spark splits batch predictions into a series of single-instance 
> predictions, reducing the latency of single predictions can lead to lower 
> latencies in batch predictions. I tried batch predictions in spark.ml (1.6.1) 
> using testing_features.map(x => model.predict( x)).collect() instead of 
> model.transform(testing_dataframe).select(“prediction”).collect(), and the 
> former had roughly 50% less latency for batches of size 1000, 10,000, and 
> 100,000.
> Although the experiment is constrained to logistic regres

[jira] [Commented] (SPARK-16307) Improve testing for DecisionTree variances

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355896#comment-15355896
 ] 

Apache Spark commented on SPARK-16307:
--

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/13981

> Improve testing for DecisionTree variances
> --
>
> Key: SPARK-16307
> URL: https://issues.apache.org/jira/browse/SPARK-16307
> Project: Spark
>  Issue Type: Test
>Reporter: Manoj Kumar
>Priority: Minor
>
> The current test assumes that Impurity.calculate() returns the variance 
> correctly. A better approach would be to test if the variance returned equals 
> the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16307) Improve testing for DecisionTree variances

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16307:


Assignee: Apache Spark

> Improve testing for DecisionTree variances
> --
>
> Key: SPARK-16307
> URL: https://issues.apache.org/jira/browse/SPARK-16307
> Project: Spark
>  Issue Type: Test
>Reporter: Manoj Kumar
>Assignee: Apache Spark
>Priority: Minor
>
> The current test assumes that Impurity.calculate() returns the variance 
> correctly. A better approach would be to test if the variance returned equals 
> the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16307) Improve testing for DecisionTree variances

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16307:


Assignee: (was: Apache Spark)

> Improve testing for DecisionTree variances
> --
>
> Key: SPARK-16307
> URL: https://issues.apache.org/jira/browse/SPARK-16307
> Project: Spark
>  Issue Type: Test
>Reporter: Manoj Kumar
>Priority: Minor
>
> The current test assumes that Impurity.calculate() returns the variance 
> correctly. A better approach would be to test if the variance returned equals 
> the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16304) LinkageError should not crash Spark executor

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16304:


Assignee: Apache Spark

> LinkageError should not crash Spark executor
> 
>
> Key: SPARK-16304
> URL: https://issues.apache.org/jira/browse/SPARK-16304
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> If we have a linkage error in the user code, Spark executors get killed 
> immediately. This is not great for user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355951#comment-15355951
 ] 

Apache Spark commented on SPARK-16021:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/13983

> Zero out freed memory in test to help catch correctness bugs
> 
>
> Key: SPARK-16021
> URL: https://issues.apache.org/jira/browse/SPARK-16021
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>
> In both on-heap and off-heap modes, it would be helpful to immediately zero 
> out (or otherwise fill with a sentinel value) memory when an object is 
> deallocated.
> Currently, in on-heap mode, freed memory can be accessed without visible 
> error if no other consumer has written to the same space. Similarly, off-heap 
> memory can be accessed without fault if the allocation library has not 
> released the pages back to the OS. Zeroing out freed memory would make these 
> errors immediately visible as a correctness problem.
> Since this would add some performance overhead, it would make sense to 
> conf-flag and enable only in test.
> cc [~sameerag] [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16304) LinkageError should not crash Spark executor

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355949#comment-15355949
 ] 

Apache Spark commented on SPARK-16304:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/13982

> LinkageError should not crash Spark executor
> 
>
> Key: SPARK-16304
> URL: https://issues.apache.org/jira/browse/SPARK-16304
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>
> If we have a linkage error in the user code, Spark executors get killed 
> immediately. This is not great for user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16021:


Assignee: Apache Spark

> Zero out freed memory in test to help catch correctness bugs
> 
>
> Key: SPARK-16021
> URL: https://issues.apache.org/jira/browse/SPARK-16021
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> In both on-heap and off-heap modes, it would be helpful to immediately zero 
> out (or otherwise fill with a sentinel value) memory when an object is 
> deallocated.
> Currently, in on-heap mode, freed memory can be accessed without visible 
> error if no other consumer has written to the same space. Similarly, off-heap 
> memory can be accessed without fault if the allocation library has not 
> released the pages back to the OS. Zeroing out freed memory would make these 
> errors immediately visible as a correctness problem.
> Since this would add some performance overhead, it would make sense to 
> conf-flag and enable only in test.
> cc [~sameerag] [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16021:


Assignee: Apache Spark

> Zero out freed memory in test to help catch correctness bugs
> 
>
> Key: SPARK-16021
> URL: https://issues.apache.org/jira/browse/SPARK-16021
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> In both on-heap and off-heap modes, it would be helpful to immediately zero 
> out (or otherwise fill with a sentinel value) memory when an object is 
> deallocated.
> Currently, in on-heap mode, freed memory can be accessed without visible 
> error if no other consumer has written to the same space. Similarly, off-heap 
> memory can be accessed without fault if the allocation library has not 
> released the pages back to the OS. Zeroing out freed memory would make these 
> errors immediately visible as a correctness problem.
> Since this would add some performance overhead, it would make sense to 
> conf-flag and enable only in test.
> cc [~sameerag] [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16310) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16310:


Assignee: Apache Spark

> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16310
> URL: https://issues.apache.org/jira/browse/SPARK-16310
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16310) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16310:


Assignee: (was: Apache Spark)

> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16310
> URL: https://issues.apache.org/jira/browse/SPARK-16310
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16310) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15355985#comment-15355985
 ] 

Apache Spark commented on SPARK-16310:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/13984

> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16310
> URL: https://issues.apache.org/jira/browse/SPARK-16310
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16114) Add network word count example

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356255#comment-15356255
 ] 

Apache Spark commented on SPARK-16114:
--

User 'jjthomas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13957

> Add network word count example
> --
>
> Key: SPARK-16114
> URL: https://issues.apache.org/jira/browse/SPARK-16114
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: James Thomas
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16313) Spark should not silently drop exceptions in file listing

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356364#comment-15356364
 ] 

Apache Spark commented on SPARK-16313:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13987

> Spark should not silently drop exceptions in file listing
> -
>
> Key: SPARK-16313
> URL: https://issues.apache.org/jira/browse/SPARK-16313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16313) Spark should not silently drop exceptions in file listing

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16313:


Assignee: Apache Spark  (was: Reynold Xin)

> Spark should not silently drop exceptions in file listing
> -
>
> Key: SPARK-16313
> URL: https://issues.apache.org/jira/browse/SPARK-16313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16313) Spark should not silently drop exceptions in file listing

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16313:


Assignee: Reynold Xin  (was: Apache Spark)

> Spark should not silently drop exceptions in file listing
> -
>
> Key: SPARK-16313
> URL: https://issues.apache.org/jira/browse/SPARK-16313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16101) Refactoring CSV data source to be consistent with JSON data source

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16101:


Assignee: Apache Spark

> Refactoring CSV data source to be consistent with JSON data source
> --
>
> Key: SPARK-16101
> URL: https://issues.apache.org/jira/browse/SPARK-16101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Currently, CSV data source has a pretty much different structure with JSON 
> data source although they can be pretty much similar.
> It would be great if they have the similar structure so that some common 
> issues can be resolved together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16101) Refactoring CSV data source to be consistent with JSON data source

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16101:


Assignee: (was: Apache Spark)

> Refactoring CSV data source to be consistent with JSON data source
> --
>
> Key: SPARK-16101
> URL: https://issues.apache.org/jira/browse/SPARK-16101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source has a pretty much different structure with JSON 
> data source although they can be pretty much similar.
> It would be great if they have the similar structure so that some common 
> issues can be resolved together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16101) Refactoring CSV data source to be consistent with JSON data source

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356393#comment-15356393
 ] 

Apache Spark commented on SPARK-16101:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/13988

> Refactoring CSV data source to be consistent with JSON data source
> --
>
> Key: SPARK-16101
> URL: https://issues.apache.org/jira/browse/SPARK-16101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source has a pretty much different structure with JSON 
> data source although they can be pretty much similar.
> It would be great if they have the similar structure so that some common 
> issues can be resolved together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356516#comment-15356516
 ] 

Apache Spark commented on SPARK-16311:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/13989

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16311:


Assignee: (was: Apache Spark)

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16311:


Assignee: Apache Spark

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16287) Implement str_to_map SQL function

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16287:


Assignee: Apache Spark

> Implement str_to_map SQL function
> -
>
> Key: SPARK-16287
> URL: https://issues.apache.org/jira/browse/SPARK-16287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16287) Implement str_to_map SQL function

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356524#comment-15356524
 ] 

Apache Spark commented on SPARK-16287:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/13990

> Implement str_to_map SQL function
> -
>
> Key: SPARK-16287
> URL: https://issues.apache.org/jira/browse/SPARK-16287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16287) Implement str_to_map SQL function

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16287:


Assignee: (was: Apache Spark)

> Implement str_to_map SQL function
> -
>
> Key: SPARK-16287
> URL: https://issues.apache.org/jira/browse/SPARK-16287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16318) xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16318:


Assignee: (was: Apache Spark)

> xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string
> ---
>
> Key: SPARK-16318
> URL: https://issues.apache.org/jira/browse/SPARK-16318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16318) xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356629#comment-15356629
 ] 

Apache Spark commented on SPARK-16318:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/13991

> xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string
> ---
>
> Key: SPARK-16318
> URL: https://issues.apache.org/jira/browse/SPARK-16318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16318) xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16318:


Assignee: Apache Spark

> xpath_int, xpath_short, xpath_long, xpath_float, xpath_double, xpath_string
> ---
>
> Key: SPARK-16318
> URL: https://issues.apache.org/jira/browse/SPARK-16318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356704#comment-15356704
 ] 

Apache Spark commented on SPARK-12177:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13992

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>Assignee: Cody Koeninger
>  Labels: consumer, kafka
> Fix For: 2.0.0
>
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16144:


Assignee: Apache Spark  (was: Yanbo Liang)

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16144:


Assignee: Yanbo Liang  (was: Apache Spark)

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356742#comment-15356742
 ] 

Apache Spark commented on SPARK-16144:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/13993

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356930#comment-15356930
 ] 

Apache Spark commented on SPARK-12177:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13996

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>Assignee: Cody Koeninger
>  Labels: consumer, kafka
> Fix For: 2.0.0
>
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16328) Implement conversion utility functions for single instances in Python

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16328:


Assignee: Apache Spark  (was: Nick Pentreath)

> Implement conversion utility functions for single instances in Python
> -
>
> Key: SPARK-16328
> URL: https://issues.apache.org/jira/browse/SPARK-16328
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>
> We have {{asML}}/{{fromML}} utility methods in Scala/Java to convert between 
> the old and new linalg types. These are missing in Python.
> For dense vectors it's actually easy to do without, e.g. {{mlDenseVector = 
> Vectors.dense(mllibDenseVector)}}, but for sparse it doesn't work easily. So 
> it would be good to have utility methods available for users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16328) Implement conversion utility functions for single instances in Python

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16328:


Assignee: Nick Pentreath  (was: Apache Spark)

> Implement conversion utility functions for single instances in Python
> -
>
> Key: SPARK-16328
> URL: https://issues.apache.org/jira/browse/SPARK-16328
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> We have {{asML}}/{{fromML}} utility methods in Scala/Java to convert between 
> the old and new linalg types. These are missing in Python.
> For dense vectors it's actually easy to do without, e.g. {{mlDenseVector = 
> Vectors.dense(mllibDenseVector)}}, but for sparse it doesn't work easily. So 
> it would be good to have utility methods available for users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16328) Implement conversion utility functions for single instances in Python

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357130#comment-15357130
 ] 

Apache Spark commented on SPARK-16328:
--

User 'MLnick' has created a pull request for this issue:
https://github.com/apache/spark/pull/13997

> Implement conversion utility functions for single instances in Python
> -
>
> Key: SPARK-16328
> URL: https://issues.apache.org/jira/browse/SPARK-16328
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> We have {{asML}}/{{fromML}} utility methods in Scala/Java to convert between 
> the old and new linalg types. These are missing in Python.
> For dense vectors it's actually easy to do without, e.g. {{mlDenseVector = 
> Vectors.dense(mllibDenseVector)}}, but for sparse it doesn't work easily. So 
> it would be good to have utility methods available for users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.10 Consumer API

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357480#comment-15357480
 ] 

Apache Spark commented on SPARK-12177:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/13998

> Update KafkaDStreams to new Kafka 0.10 Consumer API
> ---
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>Assignee: Cody Koeninger
>  Labels: consumer, kafka
> Fix For: 2.0.0
>
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16281) Implement parse_url SQL function

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16281:


Assignee: Apache Spark

> Implement parse_url SQL function
> 
>
> Key: SPARK-16281
> URL: https://issues.apache.org/jira/browse/SPARK-16281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16281) Implement parse_url SQL function

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357501#comment-15357501
 ] 

Apache Spark commented on SPARK-16281:
--

User 'janplus' has created a pull request for this issue:
https://github.com/apache/spark/pull/13999

> Implement parse_url SQL function
> 
>
> Key: SPARK-16281
> URL: https://issues.apache.org/jira/browse/SPARK-16281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16281) Implement parse_url SQL function

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16281:


Assignee: (was: Apache Spark)

> Implement parse_url SQL function
> 
>
> Key: SPARK-16281
> URL: https://issues.apache.org/jira/browse/SPARK-16281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16331) [SQL] Reduce code generation time

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16331:


Assignee: Apache Spark

> [SQL] Reduce code generation time 
> --
>
> Key: SPARK-16331
> URL: https://issues.apache.org/jira/browse/SPARK-16331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Hiroshi Inoue
>Assignee: Apache Spark
>
> During the code generation, a {{LocalRelation}} often has a huge {{Vector}} 
> object as {{data}}. In the simple example below, a {{LocalRelation}} has a 
> Vector with 100 elements of {{UnsafeRow}}. 
> {quote}
> val numRows = 100
> val ds = (1 to numRows).toDS().persist()
> benchmark.addCase("filter+reduce") { iter =>
>   ds.filter(a => (a & 1) == 0).reduce(_ + _)
> }
> {quote}
> At {{TreeNode.transformChildren}}, all elements of the vector is 
> unnecessarily iterated to check whether any children exist in the vector 
> since {{Vector}} is Traversable. This part significantly increases code 
> generation time.
> This patch avoids this overhead by checking the number of children before 
> iterating all elements; {{LocalRelation}} does not have children since it 
> extends {{LeafNode}}.
> The performance of the above example 
> {quote}
> without this patch
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
> Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 4426 / 4533  0.2
> 4426.0   1.0X
> with this patch
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 3117 / 3391  0.3
> 3116.6   1.0X
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16331) [SQL] Reduce code generation time

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16331:


Assignee: (was: Apache Spark)

> [SQL] Reduce code generation time 
> --
>
> Key: SPARK-16331
> URL: https://issues.apache.org/jira/browse/SPARK-16331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Hiroshi Inoue
>
> During the code generation, a {{LocalRelation}} often has a huge {{Vector}} 
> object as {{data}}. In the simple example below, a {{LocalRelation}} has a 
> Vector with 100 elements of {{UnsafeRow}}. 
> {quote}
> val numRows = 100
> val ds = (1 to numRows).toDS().persist()
> benchmark.addCase("filter+reduce") { iter =>
>   ds.filter(a => (a & 1) == 0).reduce(_ + _)
> }
> {quote}
> At {{TreeNode.transformChildren}}, all elements of the vector is 
> unnecessarily iterated to check whether any children exist in the vector 
> since {{Vector}} is Traversable. This part significantly increases code 
> generation time.
> This patch avoids this overhead by checking the number of children before 
> iterating all elements; {{LocalRelation}} does not have children since it 
> extends {{LeafNode}}.
> The performance of the above example 
> {quote}
> without this patch
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
> Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 4426 / 4533  0.2
> 4426.0   1.0X
> with this patch
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 3117 / 3391  0.3
> 3116.6   1.0X
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16331) [SQL] Reduce code generation time

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357519#comment-15357519
 ] 

Apache Spark commented on SPARK-16331:
--

User 'inouehrs' has created a pull request for this issue:
https://github.com/apache/spark/pull/14000

> [SQL] Reduce code generation time 
> --
>
> Key: SPARK-16331
> URL: https://issues.apache.org/jira/browse/SPARK-16331
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Hiroshi Inoue
>
> During the code generation, a {{LocalRelation}} often has a huge {{Vector}} 
> object as {{data}}. In the simple example below, a {{LocalRelation}} has a 
> Vector with 100 elements of {{UnsafeRow}}. 
> {quote}
> val numRows = 100
> val ds = (1 to numRows).toDS().persist()
> benchmark.addCase("filter+reduce") { iter =>
>   ds.filter(a => (a & 1) == 0).reduce(_ + _)
> }
> {quote}
> At {{TreeNode.transformChildren}}, all elements of the vector is 
> unnecessarily iterated to check whether any children exist in the vector 
> since {{Vector}} is Traversable. This part significantly increases code 
> generation time.
> This patch avoids this overhead by checking the number of children before 
> iterating all elements; {{LocalRelation}} does not have children since it 
> extends {{LeafNode}}.
> The performance of the above example 
> {quote}
> without this patch
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5
> Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 4426 / 4533  0.2
> 4426.0   1.0X
> with this patch
> compilationTime: Best/Avg Time(ms)Rate(M/s)   Per 
> Row(ns)   Relative
> 
> filter+reduce 3117 / 3391  0.3
> 3116.6   1.0X
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16256) Add Structured Streaming Programming Guide

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357825#comment-15357825
 ] 

Apache Spark commented on SPARK-16256:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/14001

> Add Structured Streaming Programming Guide
> --
>
> Key: SPARK-16256
> URL: https://issues.apache.org/jira/browse/SPARK-16256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16335) Streaming source should fail if file does not exist

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16335:


Assignee: Reynold Xin  (was: Apache Spark)

> Streaming source should fail if file does not exist
> ---
>
> Key: SPARK-16335
> URL: https://issues.apache.org/jira/browse/SPARK-16335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16335) Streaming source should fail if file does not exist

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357876#comment-15357876
 ] 

Apache Spark commented on SPARK-16335:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14002

> Streaming source should fail if file does not exist
> ---
>
> Key: SPARK-16335
> URL: https://issues.apache.org/jira/browse/SPARK-16335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16335) Streaming source should fail if file does not exist

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16335:


Assignee: Apache Spark  (was: Reynold Xin)

> Streaming source should fail if file does not exist
> ---
>
> Key: SPARK-16335
> URL: https://issues.apache.org/jira/browse/SPARK-16335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16336) Suggest doing table refresh when encountering FileNotFoundException at runtime

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16336:


Assignee: (was: Apache Spark)

> Suggest doing table refresh when encountering FileNotFoundException at runtime
> --
>
> Key: SPARK-16336
> URL: https://issues.apache.org/jira/browse/SPARK-16336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16336) Suggest doing table refresh when encountering FileNotFoundException at runtime

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357926#comment-15357926
 ] 

Apache Spark commented on SPARK-16336:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/14003

> Suggest doing table refresh when encountering FileNotFoundException at runtime
> --
>
> Key: SPARK-16336
> URL: https://issues.apache.org/jira/browse/SPARK-16336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16336) Suggest doing table refresh when encountering FileNotFoundException at runtime

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16336:


Assignee: Apache Spark

> Suggest doing table refresh when encountering FileNotFoundException at runtime
> --
>
> Key: SPARK-16336
> URL: https://issues.apache.org/jira/browse/SPARK-16336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16285) Implement sentences SQL function

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16285:


Assignee: Apache Spark

> Implement sentences SQL function
> 
>
> Key: SPARK-16285
> URL: https://issues.apache.org/jira/browse/SPARK-16285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16285) Implement sentences SQL function

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357958#comment-15357958
 ] 

Apache Spark commented on SPARK-16285:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14004

> Implement sentences SQL function
> 
>
> Key: SPARK-16285
> URL: https://issues.apache.org/jira/browse/SPARK-16285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16285) Implement sentences SQL function

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16285:


Assignee: (was: Apache Spark)

> Implement sentences SQL function
> 
>
> Key: SPARK-16285
> URL: https://issues.apache.org/jira/browse/SPARK-16285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15954) TestHive has issues being used in PySpark

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358008#comment-15358008
 ] 

Apache Spark commented on SPARK-15954:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14005

> TestHive has issues being used in PySpark
> -
>
> Key: SPARK-15954
> URL: https://issues.apache.org/jira/browse/SPARK-15954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: holdenk
>
> SPARK-15745 made TestHive unreliable from PySpark test cases, to support it 
> we should allow both resource or system property based lookup for loading the 
> hive file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13015) Replace example code in mllib-data-types.md using include_example

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358028#comment-15358028
 ] 

Apache Spark commented on SPARK-13015:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14006

> Replace example code in mllib-data-types.md using include_example
> -
>
> Key: SPARK-13015
> URL: https://issues.apache.org/jira/browse/SPARK-13015
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/LocalVectorExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/LocalVectorExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16329:


Assignee: (was: Apache Spark)

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at .()
> at $print()
>

[jira] [Commented] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358306#comment-15358306
 ] 

Apache Spark commented on SPARK-16329:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14007

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> 

[jira] [Assigned] (SPARK-16329) select * from temp_table_no_cols fails

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16329:


Assignee: Apache Spark

> select * from temp_table_no_cols fails
> --
>
> Key: SPARK-16329
> URL: https://issues.apache.org/jira/browse/SPARK-16329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Adrian Ionescu
>Assignee: Apache Spark
>
> The following works with spark 1.5.1, but not anymore with spark 1.6.0:
> {code}
> import org.apache.spark.sql.{ DataFrame, Row }
> import org.apache.spark.sql.types.StructType
> val rddNoCols = sqlContext.sparkContext.parallelize(1 to 10).map(_ => 
> Row.empty)
> val dfNoCols = sqlContext.createDataFrame(rddNoCols, StructType(Seq.empty))
> dfNoCols.registerTempTable("temp_table_no_cols")
> sqlContext.sql("select * from temp_table_no_cols").show
> {code}
> spark 1.5.1 result:
> {noformat}
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> {noformat}
> spark 1.6.0 result:
> {noformat}
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.require(Predef.scala:221)
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedStar.expand(unresolved.scala:199)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:354)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10$$anonfun$applyOrElse$14.apply(Analyzer.scala:353)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:353)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$10.applyOrElse(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:347)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:328)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:28)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:33)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:35)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:37)
> at $iwC$$iwC$$iwC$$iwC.(:39)
> at $iwC$$iwC$$iwC.(:41)
> at $iwC$$iwC.(:43)
> at $iwC.(:45)
> at (:47)
> at .(:51)
> at .()
> at .(:7)
> at .()
> 

[jira] [Commented] (SPARK-16281) Implement parse_url SQL function

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358423#comment-15358423
 ] 

Apache Spark commented on SPARK-16281:
--

User 'janplus' has created a pull request for this issue:
https://github.com/apache/spark/pull/14008

> Implement parse_url SQL function
> 
>
> Key: SPARK-16281
> URL: https://issues.apache.org/jira/browse/SPARK-16281
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16311) Improve metadata refresh

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358456#comment-15358456
 ] 

Apache Spark commented on SPARK-16311:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14009

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16339) ScriptTransform does not print stderr when outstream is lost

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16339:


Assignee: Apache Spark

> ScriptTransform does not print stderr when outstream is lost
> 
>
> Key: SPARK-16339
> URL: https://issues.apache.org/jira/browse/SPARK-16339
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently, if due to some failure, the `outstream` gets destroyed or closed 
> and later `outstream.close()` leads to `IOException` in such case : 
> https://github.com/apache/spark/blob/4f869f88ee96fa57be79f972f218111b6feac67f/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala#L325
> Due to this, the `stderrBuffer` does not get logged and there is no way for 
> users to see why the job failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16339) ScriptTransform does not print stderr when outstream is lost

2016-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358473#comment-15358473
 ] 

Apache Spark commented on SPARK-16339:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/13834

> ScriptTransform does not print stderr when outstream is lost
> 
>
> Key: SPARK-16339
> URL: https://issues.apache.org/jira/browse/SPARK-16339
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Tejas Patil
>Priority: Trivial
>
> Currently, if due to some failure, the `outstream` gets destroyed or closed 
> and later `outstream.close()` leads to `IOException` in such case : 
> https://github.com/apache/spark/blob/4f869f88ee96fa57be79f972f218111b6feac67f/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala#L325
> Due to this, the `stderrBuffer` does not get logged and there is no way for 
> users to see why the job failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16339) ScriptTransform does not print stderr when outstream is lost

2016-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16339:


Assignee: (was: Apache Spark)

> ScriptTransform does not print stderr when outstream is lost
> 
>
> Key: SPARK-16339
> URL: https://issues.apache.org/jira/browse/SPARK-16339
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Tejas Patil
>Priority: Trivial
>
> Currently, if due to some failure, the `outstream` gets destroyed or closed 
> and later `outstream.close()` leads to `IOException` in such case : 
> https://github.com/apache/spark/blob/4f869f88ee96fa57be79f972f218111b6feac67f/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala#L325
> Due to this, the `stderrBuffer` does not get logged and there is no way for 
> users to see why the job failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16343) Improve the PushDownPredicate rule to pushdown predicates currectly in non-deterministic condition

2016-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358691#comment-15358691
 ] 

Apache Spark commented on SPARK-16343:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/14012

> Improve the PushDownPredicate rule to pushdown predicates currectly in 
> non-deterministic condition
> --
>
> Key: SPARK-16343
> URL: https://issues.apache.org/jira/browse/SPARK-16343
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Jiang Xingbo
>Priority: Critical
>
> Currently our Optimizer may reorder the predicates to run them more 
> efficient, but in non-deterministic condition, change the order between 
> deterministic parts and non-deterministic parts may change the number of 
> input rows. For example:
> {code:sql}
> SELECT a FROM t WHERE rand() < 0.1 AND a = 1
> {code}
> And
> {code:sql}
> SELECT a FROM t WHERE a = 1 AND rand() < 0.1
> {code}
> may call rand() for different times and therefore the output rows differ.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16343) Improve the PushDownPredicate rule to pushdown predicates currectly in non-deterministic condition

2016-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16343:


Assignee: (was: Apache Spark)

> Improve the PushDownPredicate rule to pushdown predicates currectly in 
> non-deterministic condition
> --
>
> Key: SPARK-16343
> URL: https://issues.apache.org/jira/browse/SPARK-16343
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Jiang Xingbo
>Priority: Critical
>
> Currently our Optimizer may reorder the predicates to run them more 
> efficient, but in non-deterministic condition, change the order between 
> deterministic parts and non-deterministic parts may change the number of 
> input rows. For example:
> {code:sql}
> SELECT a FROM t WHERE rand() < 0.1 AND a = 1
> {code}
> And
> {code:sql}
> SELECT a FROM t WHERE a = 1 AND rand() < 0.1
> {code}
> may call rand() for different times and therefore the output rows differ.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16343) Improve the PushDownPredicate rule to pushdown predicates currectly in non-deterministic condition

2016-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16343:


Assignee: Apache Spark

> Improve the PushDownPredicate rule to pushdown predicates currectly in 
> non-deterministic condition
> --
>
> Key: SPARK-16343
> URL: https://issues.apache.org/jira/browse/SPARK-16343
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Critical
>
> Currently our Optimizer may reorder the predicates to run them more 
> efficient, but in non-deterministic condition, change the order between 
> deterministic parts and non-deterministic parts may change the number of 
> input rows. For example:
> {code:sql}
> SELECT a FROM t WHERE rand() < 0.1 AND a = 1
> {code}
> And
> {code:sql}
> SELECT a FROM t WHERE a = 1 AND rand() < 0.1
> {code}
> may call rand() for different times and therefore the output rows differ.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

2016-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358799#comment-15358799
 ] 

Apache Spark commented on SPARK-16344:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/14013

> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> 
>
> Key: SPARK-16344
> URL: https://issues.apache.org/jira/browse/SPARK-16344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> The following Spark shell snippet for Spark 1.6 reproduces this bug:
> {code}
> case class A(element: Long)
> case class B(f: Array[A])
> val path = "/tmp/silly.parquet"
> Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path)
> val df = sqlContext.read.parquet(path)
> df.printSchema()
> // root
> //  |-- f0: array (nullable = true)
> //  ||-- element: struct (containsNull = true)
> //  |||-- element: long (nullable = true)
> df.show()
> {code}
> Exception thrown:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: Expected instance of group converter 
> but got 
> "org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter"
> at 
> org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
> at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:266)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
> at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
> at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   

[jira] [Assigned] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

2016-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16344:


Assignee: Cheng Lian  (was: Apache Spark)

> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> 
>
> Key: SPARK-16344
> URL: https://issues.apache.org/jira/browse/SPARK-16344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> The following Spark shell snippet for Spark 1.6 reproduces this bug:
> {code}
> case class A(element: Long)
> case class B(f: Array[A])
> val path = "/tmp/silly.parquet"
> Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path)
> val df = sqlContext.read.parquet(path)
> df.printSchema()
> // root
> //  |-- f0: array (nullable = true)
> //  ||-- element: struct (containsNull = true)
> //  |||-- element: long (nullable = true)
> df.show()
> {code}
> Exception thrown:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: Expected instance of group converter 
> but got 
> "org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter"
> at 
> org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
> at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:266)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
> at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
> at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
> ... 26 more
> {noformat}
> Spark 2.0.0-SNAPSHOT and Spark master also suffer this issue. To reproduce 

[jira] [Assigned] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

2016-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16344:


Assignee: Apache Spark  (was: Cheng Lian)

> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> 
>
> Key: SPARK-16344
> URL: https://issues.apache.org/jira/browse/SPARK-16344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> The following Spark shell snippet for Spark 1.6 reproduces this bug:
> {code}
> case class A(element: Long)
> case class B(f: Array[A])
> val path = "/tmp/silly.parquet"
> Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path)
> val df = sqlContext.read.parquet(path)
> df.printSchema()
> // root
> //  |-- f0: array (nullable = true)
> //  ||-- element: struct (containsNull = true)
> //  |||-- element: long (nullable = true)
> df.show()
> {code}
> Exception thrown:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: Expected instance of group converter 
> but got 
> "org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter"
> at 
> org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
> at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:266)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
> at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
> at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
> ... 26 more
> {noformat}
> Spark 2.0.0-SNAPSHOT and Spark master also suffer this issue. To reproduc

[jira] [Commented] (SPARK-16344) Array of struct with a single field name "element" can't be decoded from Parquet files written by Spark 1.6+

2016-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358837#comment-15358837
 ] 

Apache Spark commented on SPARK-16344:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/14014

> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> 
>
> Key: SPARK-16344
> URL: https://issues.apache.org/jira/browse/SPARK-16344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Array of struct with a single field name "element" can't be decoded from 
> Parquet files written by Spark 1.6+
> The following Spark shell snippet for Spark 1.6 reproduces this bug:
> {code}
> case class A(element: Long)
> case class B(f: Array[A])
> val path = "/tmp/silly.parquet"
> Seq(B(Array(A(42.toDF("f0").write.mode("overwrite").parquet(path)
> val df = sqlContext.read.parquet(path)
> df.printSchema()
> // root
> //  |-- f0: array (nullable = true)
> //  ||-- element: struct (containsNull = true)
> //  |||-- element: long (nullable = true)
> df.show()
> {code}
> Exception thrown:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> file:/tmp/silly.parquet/part-r-7-e06db7b0-5181-4a14-9fee-5bb452e883a0.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:194)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: Expected instance of group converter 
> but got 
> "org.apache.spark.sql.execution.datasources.parquet.CatalystPrimitiveConverter"
> at 
> org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
> at 
> org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:266)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:134)
> at 
> org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:99)
> at 
> org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:154)
> at 
> org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:99)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   

[jira] [Assigned] (SPARK-16345) Extract graphx programming guide example snippets from source files instead of hard code them

2016-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16345:


Assignee: Apache Spark

> Extract graphx programming guide example snippets from source files instead 
> of hard code them
> -
>
> Key: SPARK-16345
> URL: https://issues.apache.org/jira/browse/SPARK-16345
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples, GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>
> Currently, all example snippets in the graphx programming guide are 
> hard-coded, which can be pretty hard to update and verify. On the contrary, 
> ML document pages are using the include_example Jekyll plugin to extract 
> snippets from actual source files under the examples sub-project. In this 
> way, we can guarantee that Java and Scala code are compilable, and it would 
> be much easier to verify these example snippets since they are part of 
> complete Spark applications.
> The similar task is SPARK-7924.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16345) Extract graphx programming guide example snippets from source files instead of hard code them

2016-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16345:


Assignee: (was: Apache Spark)

> Extract graphx programming guide example snippets from source files instead 
> of hard code them
> -
>
> Key: SPARK-16345
> URL: https://issues.apache.org/jira/browse/SPARK-16345
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples, GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Weichen Xu
>
> Currently, all example snippets in the graphx programming guide are 
> hard-coded, which can be pretty hard to update and verify. On the contrary, 
> ML document pages are using the include_example Jekyll plugin to extract 
> snippets from actual source files under the examples sub-project. In this 
> way, we can guarantee that Java and Scala code are compilable, and it would 
> be much easier to verify these example snippets since they are part of 
> complete Spark applications.
> The similar task is SPARK-7924.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16345) Extract graphx programming guide example snippets from source files instead of hard code them

2016-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15358844#comment-15358844
 ] 

Apache Spark commented on SPARK-16345:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/14015

> Extract graphx programming guide example snippets from source files instead 
> of hard code them
> -
>
> Key: SPARK-16345
> URL: https://issues.apache.org/jira/browse/SPARK-16345
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Examples, GraphX
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Weichen Xu
>
> Currently, all example snippets in the graphx programming guide are 
> hard-coded, which can be pretty hard to update and verify. On the contrary, 
> ML document pages are using the include_example Jekyll plugin to extract 
> snippets from actual source files under the examples sub-project. In this 
> way, we can guarantee that Java and Scala code are compilable, and it would 
> be much easier to verify these example snippets since they are part of 
> complete Spark applications.
> The similar task is SPARK-7924.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15761) pyspark shell should load if PYSPARK_DRIVER_PYTHON is ipython an Python3

2016-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359334#comment-15359334
 ] 

Apache Spark commented on SPARK-15761:
--

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/14016

> pyspark shell should load if PYSPARK_DRIVER_PYTHON is ipython an Python3
> 
>
> Key: SPARK-15761
> URL: https://issues.apache.org/jira/browse/SPARK-15761
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Manoj Kumar
>Assignee: Manoj Kumar
>Priority: Minor
> Fix For: 1.6.3, 2.0.1
>
>
> My default python is ipython3 and it is odd that it fails with "IPython 
> requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16212) code cleanup of kafka-0-8 to match review feedback on 0-10

2016-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359591#comment-15359591
 ] 

Apache Spark commented on SPARK-16212:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/14018

> code cleanup of kafka-0-8 to match review feedback on 0-10
> --
>
> Key: SPARK-16212
> URL: https://issues.apache.org/jira/browse/SPARK-16212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16223) Codegen failure with a Dataframe program using an array

2016-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16223:


Assignee: Apache Spark

> Codegen failure with a Dataframe program using an array
> ---
>
> Key: SPARK-16223
> URL: https://issues.apache.org/jira/browse/SPARK-16223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>
> When we compile a Dataframe program with an operation to large array, 
> compilation failure occurs. This is because a local variable 
> {{inputadapter_value}} cannot be referenced in {{apply()}} method that is 
> generated by {{CodegenContext.splitExpressions()}}. The local variable is 
> defined in {{processNext()}} method.
> What is better approach to resolve this?  Is it better to pass 
> {{inputadapter_value}} to {{apply()}} method?
> Example program
> {code}
> val n = 500
> val statement = (0 to n - 1).map(i => s"value + 1.0d")
>   .mkString("Array(", ",", ")")
> sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
>   .selectExpr(statement).showString(1, true)
> {code}
> Generated code and stack trace
> {code:java}
> 23:10:45.801 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 30, Column 36: Expression "inputadapter_value" is not 
> an rvalue
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator inputadapter_input;
> /* 008 */   private Object[] project_values;
> /* 009 */   private UnsafeRow project_result;
> /* 010 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
> /* 011 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> project_rowWriter;
> /* 012 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> project_arrayWriter;
> /* 013 */
> /* 014 */   public GeneratedIterator(Object[] references) {
> /* 015 */ this.references = references;
> /* 016 */   }
> /* 017 */
> /* 018 */   public void init(int index, scala.collection.Iterator inputs[]) {
> /* 019 */ partitionIndex = index;
> /* 020 */ inputadapter_input = inputs[0];
> /* 021 */ this.project_values = null;
> /* 022 */ project_result = new UnsafeRow(1);
> /* 023 */ this.project_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result,
>  32);
> /* 024 */ this.project_rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
>  1);
> /* 025 */ this.project_arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 026 */   }
> /* 027 */
> /* 028 */   private void project_apply_0(InternalRow inputadapter_row) {
> /* 029 */ double project_value1 = -1.0;
> /* 030 */ project_value1 = inputadapter_value + 1.0D;
> /* 031 */ if (false) {
> /* 032 */   project_values[0] = null;
> /* 033 */ } else {
> /* 034 */   project_values[0] = project_value1;
> /* 035 */ }
> /* 036 */
> /* 037 */ double project_value4 = -1.0;
> /* 038 */ project_value4 = inputadapter_value + 1.0D;
> /* 039 */ if (false) {
> /* 040 */   project_values[1] = null;
> /* 041 */ } else {
> /* 042 */   project_values[1] = project_value4;
> /* 043 */ }
> ...
> /* 4032 */   }
> /* 4033 */
> /* 4034 */   protected void processNext() throws java.io.IOException {
> /* 4035 */ while (inputadapter_input.hasNext()) {
> /* 4036 */   InternalRow inputadapter_row = (InternalRow) 
> inputadapter_input.next();
> /* 4037 */   System.out.println("row: " + inputadapter_row.getClass() + 
> ", " + inputadapter_row);
> /* 4038 */   double inputadapter_value = inputadapter_row.getDouble(0);
> /* 4039 */
> /* 4040 */   final boolean project_isNull = false;
> /* 4041 */   this.project_values = new 
> Object[500];project_apply_0(inputadapter_row);
> /* 4042 */   project_apply_1(inputadapter_row);
> /* 4043 */   /* final ArrayData project_value = 
> org.apache.spark.sql.catalyst.util.GenericArrayData.allocate(project_values); 
> */
> /* 4044 */   final ArrayData project_value = new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
> /* 4045 */   this.project_values = null;
> /* 4046 */   project_holder.reset();
> /* 4047 */
> /* 4048 */   project_rowWriter.zeroOutNullBytes();
> /* 4049 */
> /* 4050 */   if (project

[jira] [Assigned] (SPARK-16223) Codegen failure with a Dataframe program using an array

2016-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16223:


Assignee: (was: Apache Spark)

> Codegen failure with a Dataframe program using an array
> ---
>
> Key: SPARK-16223
> URL: https://issues.apache.org/jira/browse/SPARK-16223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When we compile a Dataframe program with an operation to large array, 
> compilation failure occurs. This is because a local variable 
> {{inputadapter_value}} cannot be referenced in {{apply()}} method that is 
> generated by {{CodegenContext.splitExpressions()}}. The local variable is 
> defined in {{processNext()}} method.
> What is better approach to resolve this?  Is it better to pass 
> {{inputadapter_value}} to {{apply()}} method?
> Example program
> {code}
> val n = 500
> val statement = (0 to n - 1).map(i => s"value + 1.0d")
>   .mkString("Array(", ",", ")")
> sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
>   .selectExpr(statement).showString(1, true)
> {code}
> Generated code and stack trace
> {code:java}
> 23:10:45.801 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 30, Column 36: Expression "inputadapter_value" is not 
> an rvalue
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator inputadapter_input;
> /* 008 */   private Object[] project_values;
> /* 009 */   private UnsafeRow project_result;
> /* 010 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
> /* 011 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> project_rowWriter;
> /* 012 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> project_arrayWriter;
> /* 013 */
> /* 014 */   public GeneratedIterator(Object[] references) {
> /* 015 */ this.references = references;
> /* 016 */   }
> /* 017 */
> /* 018 */   public void init(int index, scala.collection.Iterator inputs[]) {
> /* 019 */ partitionIndex = index;
> /* 020 */ inputadapter_input = inputs[0];
> /* 021 */ this.project_values = null;
> /* 022 */ project_result = new UnsafeRow(1);
> /* 023 */ this.project_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result,
>  32);
> /* 024 */ this.project_rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
>  1);
> /* 025 */ this.project_arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 026 */   }
> /* 027 */
> /* 028 */   private void project_apply_0(InternalRow inputadapter_row) {
> /* 029 */ double project_value1 = -1.0;
> /* 030 */ project_value1 = inputadapter_value + 1.0D;
> /* 031 */ if (false) {
> /* 032 */   project_values[0] = null;
> /* 033 */ } else {
> /* 034 */   project_values[0] = project_value1;
> /* 035 */ }
> /* 036 */
> /* 037 */ double project_value4 = -1.0;
> /* 038 */ project_value4 = inputadapter_value + 1.0D;
> /* 039 */ if (false) {
> /* 040 */   project_values[1] = null;
> /* 041 */ } else {
> /* 042 */   project_values[1] = project_value4;
> /* 043 */ }
> ...
> /* 4032 */   }
> /* 4033 */
> /* 4034 */   protected void processNext() throws java.io.IOException {
> /* 4035 */ while (inputadapter_input.hasNext()) {
> /* 4036 */   InternalRow inputadapter_row = (InternalRow) 
> inputadapter_input.next();
> /* 4037 */   System.out.println("row: " + inputadapter_row.getClass() + 
> ", " + inputadapter_row);
> /* 4038 */   double inputadapter_value = inputadapter_row.getDouble(0);
> /* 4039 */
> /* 4040 */   final boolean project_isNull = false;
> /* 4041 */   this.project_values = new 
> Object[500];project_apply_0(inputadapter_row);
> /* 4042 */   project_apply_1(inputadapter_row);
> /* 4043 */   /* final ArrayData project_value = 
> org.apache.spark.sql.catalyst.util.GenericArrayData.allocate(project_values); 
> */
> /* 4044 */   final ArrayData project_value = new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
> /* 4045 */   this.project_values = null;
> /* 4046 */   project_holder.reset();
> /* 4047 */
> /* 4048 */   project_rowWriter.zeroOutNullBytes();
> /* 4049 */
> /* 4050 */   if (project_isNull) {
> /* 4051 */  

[jira] [Commented] (SPARK-16223) Codegen failure with a Dataframe program using an array

2016-07-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359716#comment-15359716
 ] 

Apache Spark commented on SPARK-16223:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14019

> Codegen failure with a Dataframe program using an array
> ---
>
> Key: SPARK-16223
> URL: https://issues.apache.org/jira/browse/SPARK-16223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kazuaki Ishizaki
>
> When we compile a Dataframe program with an operation to large array, 
> compilation failure occurs. This is because a local variable 
> {{inputadapter_value}} cannot be referenced in {{apply()}} method that is 
> generated by {{CodegenContext.splitExpressions()}}. The local variable is 
> defined in {{processNext()}} method.
> What is better approach to resolve this?  Is it better to pass 
> {{inputadapter_value}} to {{apply()}} method?
> Example program
> {code}
> val n = 500
> val statement = (0 to n - 1).map(i => s"value + 1.0d")
>   .mkString("Array(", ",", ")")
> sparkContext.parallelize(Seq(0.0d, 1.0d), 1).toDF
>   .selectExpr(statement).showString(1, true)
> {code}
> Generated code and stack trace
> {code:java}
> 23:10:45.801 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 30, Column 36: Expression "inputadapter_value" is not 
> an rvalue
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator inputadapter_input;
> /* 008 */   private Object[] project_values;
> /* 009 */   private UnsafeRow project_result;
> /* 010 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder project_holder;
> /* 011 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> project_rowWriter;
> /* 012 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter 
> project_arrayWriter;
> /* 013 */
> /* 014 */   public GeneratedIterator(Object[] references) {
> /* 015 */ this.references = references;
> /* 016 */   }
> /* 017 */
> /* 018 */   public void init(int index, scala.collection.Iterator inputs[]) {
> /* 019 */ partitionIndex = index;
> /* 020 */ inputadapter_input = inputs[0];
> /* 021 */ this.project_values = null;
> /* 022 */ project_result = new UnsafeRow(1);
> /* 023 */ this.project_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(project_result,
>  32);
> /* 024 */ this.project_rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(project_holder,
>  1);
> /* 025 */ this.project_arrayWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter();
> /* 026 */   }
> /* 027 */
> /* 028 */   private void project_apply_0(InternalRow inputadapter_row) {
> /* 029 */ double project_value1 = -1.0;
> /* 030 */ project_value1 = inputadapter_value + 1.0D;
> /* 031 */ if (false) {
> /* 032 */   project_values[0] = null;
> /* 033 */ } else {
> /* 034 */   project_values[0] = project_value1;
> /* 035 */ }
> /* 036 */
> /* 037 */ double project_value4 = -1.0;
> /* 038 */ project_value4 = inputadapter_value + 1.0D;
> /* 039 */ if (false) {
> /* 040 */   project_values[1] = null;
> /* 041 */ } else {
> /* 042 */   project_values[1] = project_value4;
> /* 043 */ }
> ...
> /* 4032 */   }
> /* 4033 */
> /* 4034 */   protected void processNext() throws java.io.IOException {
> /* 4035 */ while (inputadapter_input.hasNext()) {
> /* 4036 */   InternalRow inputadapter_row = (InternalRow) 
> inputadapter_input.next();
> /* 4037 */   System.out.println("row: " + inputadapter_row.getClass() + 
> ", " + inputadapter_row);
> /* 4038 */   double inputadapter_value = inputadapter_row.getDouble(0);
> /* 4039 */
> /* 4040 */   final boolean project_isNull = false;
> /* 4041 */   this.project_values = new 
> Object[500];project_apply_0(inputadapter_row);
> /* 4042 */   project_apply_1(inputadapter_row);
> /* 4043 */   /* final ArrayData project_value = 
> org.apache.spark.sql.catalyst.util.GenericArrayData.allocate(project_values); 
> */
> /* 4044 */   final ArrayData project_value = new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(project_values);
> /* 4045 */   this.project_values = null;
> /* 4046 */   project_holder.reset();
> /* 4047 */
> /* 

[jira] [Assigned] (SPARK-16233) test_sparkSQL.R is failing

2016-07-01 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16233:


Assignee: Apache Spark

> test_sparkSQL.R is failing
> --
>
> Key: SPARK-16233
> URL: https://issues.apache.org/jira/browse/SPARK-16233
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.0.0
>Reporter: Xin Ren
>Assignee: Apache Spark
>Priority: Minor
>
> By running 
> {code}
> ./R/run-tests.sh 
> {code}
> Getting error:
> {code}
> xin:spark xr$ ./R/run-tests.sh
> Warning: Ignoring non-spark config property: SPARK_SCALA_VERSION=2.11
> Loading required package: methods
> Attaching package: ‘SparkR’
> The following object is masked from ‘package:testthat’:
> describe
> The following objects are masked from ‘package:stats’:
> cov, filter, lag, na.omit, predict, sd, var, window
> The following objects are masked from ‘package:base’:
> as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
> rank, rbind, sample, startsWith, subset, summary, transform, union
> binary functions: ...
> functions on binary files: 
> broadcast variables: ..
> functions in client.R: .
> test functions in sparkR.R: .Re-using existing Spark Context. Call 
> sparkR.session.stop() or restart R to create a new Spark Context
> Re-using existing Spark Context. Call sparkR.session.stop() or restart R 
> to create a new Spark Context
> ...
> include an external JAR in SparkContext: Warning: Ignoring non-spark config 
> property: SPARK_SCALA_VERSION=2.11
> ..
> include R packages:
> MLlib functions: .SLF4J: Failed to load class 
> "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> .27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Maximum row group padding size is 0 bytes
> 27-Jun-2016 1:51:25 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 65,622
> 27-Jun-2016 1:51:25 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] 
> BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, RLE, 
> BIT_PACKED]
> 27-Jun-2016 1:51:25 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, 
> list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, 
> encodings: [PLAIN, RLE]
> 27-Jun-2016 1:51:25 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for 
> [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: 
> [PLAIN, BIT_PACKED]
> 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet block size to 134217728
> 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet page size to 1048576
> 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Parquet dictionary page size to 1048576
> 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Dictionary is on
> 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Validation is off
> 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Writer version is: PARQUET_1_0
> 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
> Maximum row group padding size is 0 bytes
> 27-Jun-2016 1:51:26 PM INFO: 
> org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem 
> columnStore to file. allocated memory: 49
> 27-Jun-2016 1:51:26 PM INFO: 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, 
> list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: 
> [PLAIN, RLE]
> 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
>

<    7   8   9   10   11   12   13   >