[jira] [Resolved] (SPARK-16294) Labelling support for the include_example Jekyll plugin

2016-06-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16294.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13972
[https://github.com/apache/spark/pull/13972]

> Labelling support for the include_example Jekyll plugin
> ---
>
> Key: SPARK-16294
> URL: https://issues.apache.org/jira/browse/SPARK-16294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> Part of the Spark programming guide pages are using the {{include_example}} 
> Jekyll plugin to extract line blocks surrounded by {{example on}} and 
> {{example off}} tag pairs (usually written as comments) from source files 
> under the {{examples}} sub-project.
> One limitation of the {{include_example}} plugin is that all line blocks 
> within a single file must be included in a single example snippet block in 
> the final HTML file. It would be nice to add labelling support to this 
> plugin, so that we mark different line blocks with different labels in the 
> source file:
> {code}
> // $example on:init_session$
> val spark = SparkSession
>   .builder()
>   .appName("Spark Examples")
>   .config("spark.some.config.option", "some-value")
>   .getOrCreate()
> // For implicit conversions like RDDs to DataFrames
> import spark.implicits._
> // $example off:init_session$
> // $example on:create_df$
> val df = spark.read.json("examples/src/main/resources/people.json")
> // Displays the content of the DataFrame to stdout
> df.show()
> // age  name
> // null Michael
> // 30   Andy
> // 19   Justin
> // $example off:create_df$
> {code}
> and then, by referring different labels in the Liquid template, like this:
> {code}
> {% include_example init_session 
> scala/org/apache/spark/examples/sql/SparkSessionExample.scala %}
> {% include_example create_df 
> scala/org/apache/spark/examples/sql/SparkSessionExample.scala %}
> {code}
> we may generate multiple example snippet blocks in the final HTML page from a 
> single source file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16290) text type features column for classification

2016-06-29 Thread mahendra singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356548#comment-15356548
 ] 

mahendra singh commented on SPARK-16290:


[~srowen] Hi srowen , 
 have one issue with spark regarding with text type features for naive bayes . 
I have following data 

Male , Suspicion of Alcohol , Weekday , 12 ,75 , 30-39 
Male , Moving Traffic Violation , Weekday , 12 , 20 ,20-24 
Male , Suspicion of Alcohol , Weekend , 4 , 1 2, 40-49 
Male , Suspicion of Alcohol , Weekday , 12 , 0 , 50-59 
Female , Road Traffic Collision , Weekend , 12 , 0 , 20-24 
Male , Road Traffic Collision  , Weekday , 12 , 0 , 25-29 
Male , Road Traffic Collision , Weekday , 8 , 0 , Other 
Male , Road Traffic Collision , Weekday , 8 , 23 , 60-69
Male , Moving Traffic Violation  , Weekend , 4, 26, 30-39
Female , Road Traffic Collision , Weekend, 8 , 61, 16-19  
Male , Moving Traffic Violation , Weekend , 4 , 74 , 25-29 
Male , Road Traffic Collision , Weekday , 12, 0 , Other 
Male  , Moving Traffic Violation , Weekday , 8 , 0 , 16-19 
Male , Road Traffic Collision , Weekday , 8 , 0 , Other
Male , Moving Traffic Violation , Weekend , 4 , 0 ,30-39

In this data you can see some column (comma separated ) are numeric and some 
are text data . Now spark naive bayes only support numeric type data . So how 
can transform text type to numeric  type . Every time ( training and testing ) 
numeric value for text type should be same other wise it will create problem . 
Is it possible through spark now , i am asking because i did not find solution 
for this . If it is possible then how and if not then can solve this issue ?

> text type features column for classification
> 
>
> Key: SPARK-16290
> URL: https://issues.apache.org/jira/browse/SPARK-16290
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLilb
>Affects Versions: 1.6.2
>Reporter: mahendra singh
>  Labels: features
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> we have to improve spark ml and mllib in case of features columns . Mean we 
> can give text type of value also in features . 
> Suppose we have 4 features value 
> id. dept_name. score. result. 
> We can see dept_name will be text type so we have to handle it internally in 
> spark mean we have to change text to numerical column . 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16316) dataframe except API returning wrong result in spark 1.5.0

2016-06-29 Thread Jacky Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacky Li updated SPARK-16316:
-
Description: 
Version: spark 1.5.0
Use case:  use except API to do subtract between two dataframe

scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j")
dfa: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j")
dfb: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> dfa.except(dfb).count
res13: Long = 0

It should return 90 instead of 0

While following statement works fine
scala> dfa.except(dfb).rdd.count
res13: Long = 90

I guess the bug maybe somewhere in dataframe.count


  was:
Version: spark 1.5.0
Use case:  use except API to do subtract between two dataframe

scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j")
dfa: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j")
dfb: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> dfa.except(dfb).count
res13: Long = 0

It should return 90 instead of 0

While following statement works fine
scala> dfa.except(dfb).rdd.count
res13: Long = 0

I guess the bug maybe somewhere in dataframe.count



> dataframe except API returning wrong result in spark 1.5.0
> --
>
> Key: SPARK-16316
> URL: https://issues.apache.org/jira/browse/SPARK-16316
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Jacky Li
>
> Version: spark 1.5.0
> Use case:  use except API to do subtract between two dataframe
> scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j")
> dfa: org.apache.spark.sql.DataFrame = [i: int, j: int]
> scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j")
> dfb: org.apache.spark.sql.DataFrame = [i: int, j: int]
> scala> dfa.except(dfb).count
> res13: Long = 0
> It should return 90 instead of 0
> While following statement works fine
> scala> dfa.except(dfb).rdd.count
> res13: Long = 90
> I guess the bug maybe somewhere in dataframe.count



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16316) dataframe except API returning wrong result in spark 1.5.0

2016-06-29 Thread Jacky Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacky Li updated SPARK-16316:
-
Description: 
Version: spark 1.5.0
Use case:  use except API to do subtract between two dataframe

scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j")
dfa: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j")
dfb: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> dfa.except(dfb).count
res13: Long = 0

It should return 90 instead of 0

While following statement works fine
scala> dfa.except(dfb).rdd.count
res13: Long = 0

I guess the bug maybe somewhere in dataframe.count


  was:
Version: spark 1.5.0
Use case:  use except API to do subtract between two dataframe

scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j")
dfa: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j")
dfb: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> dfa.except(dfb).count
res13: Long = 0

It should return 90 instead of 0



> dataframe except API returning wrong result in spark 1.5.0
> --
>
> Key: SPARK-16316
> URL: https://issues.apache.org/jira/browse/SPARK-16316
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Jacky Li
>
> Version: spark 1.5.0
> Use case:  use except API to do subtract between two dataframe
> scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j")
> dfa: org.apache.spark.sql.DataFrame = [i: int, j: int]
> scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j")
> dfb: org.apache.spark.sql.DataFrame = [i: int, j: int]
> scala> dfa.except(dfb).count
> res13: Long = 0
> It should return 90 instead of 0
> While following statement works fine
> scala> dfa.except(dfb).rdd.count
> res13: Long = 0
> I guess the bug maybe somewhere in dataframe.count



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16316) dataframe except API returning wrong result in spark 1.5.0

2016-06-29 Thread Jacky Li (JIRA)
Jacky Li created SPARK-16316:


 Summary: dataframe except API returning wrong result in spark 1.5.0
 Key: SPARK-16316
 URL: https://issues.apache.org/jira/browse/SPARK-16316
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Jacky Li


Version: spark 1.5.0
Use case:  use except API to do subtract between two dataframe

scala> val dfa = sc.parallelize(1 to 100).map(x => (x, x)).toDF("i", "j")
dfa: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> val dfb = sc.parallelize(1 to 10).map(x => (x, x)).toDF("i", "j")
dfb: org.apache.spark.sql.DataFrame = [i: int, j: int]

scala> dfa.except(dfb).count
res13: Long = 0

It should return 90 instead of 0




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16315) Implement code generation for elt function

2016-06-29 Thread Peter Lee (JIRA)
Peter Lee created SPARK-16315:
-

 Summary: Implement code generation for elt function
 Key: SPARK-16315
 URL: https://issues.apache.org/jira/browse/SPARK-16315
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Peter Lee


This is a follow-up for SPARK-16276.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16287) Implement str_to_map SQL function

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356524#comment-15356524
 ] 

Apache Spark commented on SPARK-16287:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/13990

> Implement str_to_map SQL function
> -
>
> Key: SPARK-16287
> URL: https://issues.apache.org/jira/browse/SPARK-16287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16287) Implement str_to_map SQL function

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16287:


Assignee: (was: Apache Spark)

> Implement str_to_map SQL function
> -
>
> Key: SPARK-16287
> URL: https://issues.apache.org/jira/browse/SPARK-16287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16287) Implement str_to_map SQL function

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16287:


Assignee: Apache Spark

> Implement str_to_map SQL function
> -
>
> Key: SPARK-16287
> URL: https://issues.apache.org/jira/browse/SPARK-16287
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16311:


Assignee: Apache Spark

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16311:


Assignee: (was: Apache Spark)

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356516#comment-15356516
 ] 

Apache Spark commented on SPARK-16311:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/13989

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16314) Spark application got stuck when NM running executor is restarted

2016-06-29 Thread Yesha Vora (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356401#comment-15356401
 ] 

Yesha Vora commented on SPARK-16314:


Thanks [~jerryshao] for analysis.

{code}
Looking though the log, I think we're running into some RPC timeout and retry 
problems. In this scenario NM recovery is enabled:
1. we will kill and restart the NM, so this will run into a race condition 
where container is allocated and executor is starting to connect to external 
shuffle service, in this time if NM is failed, executor will be failed (cannot 
connect to external shuffle service).
2. Once executor is exited, driver will issue RPC requests to ask AM the reason 
about failure, in this situation failed executors are in the zombie status, 
which means driver will still keep the metadata of these executor, only when AM 
report back the results driver will clean the zombie executors. But in the NM 
failed situation, AM cannot get the failed container state until RPC timeout 
(120s), also timed out RPC will be retried (again wait until 120s timeout).
3. In the meantime If more than 3 executors are failed due to this issue AM and 
driver will be exited. At this time if NM is restarted, it will report failed 
containers to AM and AM will send RemoveExecutor to driver, at this time driver 
is already exited, so this message never be delivered, wait until timeout 
(120s) and retry.
So this cumulative timeout will hang the application exiting and delay 
reattempt of this application, that's why we saw the application is hang.
I think in this test, we're running into the corner case. 
{code}

> Spark application got stuck when NM running executor is restarted
> -
>
> Key: SPARK-16314
> URL: https://issues.apache.org/jira/browse/SPARK-16314
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>
> Spark Application hangs if Nodemanager running executor is stopped.
> * start LogQuery application
> * This application starts 2 executors. Each in different nodes.
> * restart one of the nodemanagers.
> The application stays at 10% progress till 12 minutes. 
> Expected behavior: Application should either pass or fail. It should not 
> hang. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16314) Spark application got stuck when NM running executor is restarted

2016-06-29 Thread Yesha Vora (JIRA)
Yesha Vora created SPARK-16314:
--

 Summary: Spark application got stuck when NM running executor is 
restarted
 Key: SPARK-16314
 URL: https://issues.apache.org/jira/browse/SPARK-16314
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
Reporter: Yesha Vora


Spark Application hangs if Nodemanager running executor is stopped.

* start LogQuery application
* This application starts 2 executors. Each in different nodes.
* restart one of the nodemanagers.

The application stays at 10% progress till 12 minutes. 

Expected behavior: Application should either pass or fail. It should not hang. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16101) Refactoring CSV data source to be consistent with JSON data source

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356393#comment-15356393
 ] 

Apache Spark commented on SPARK-16101:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/13988

> Refactoring CSV data source to be consistent with JSON data source
> --
>
> Key: SPARK-16101
> URL: https://issues.apache.org/jira/browse/SPARK-16101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source has a pretty much different structure with JSON 
> data source although they can be pretty much similar.
> It would be great if they have the similar structure so that some common 
> issues can be resolved together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16101) Refactoring CSV data source to be consistent with JSON data source

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16101:


Assignee: Apache Spark

> Refactoring CSV data source to be consistent with JSON data source
> --
>
> Key: SPARK-16101
> URL: https://issues.apache.org/jira/browse/SPARK-16101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> Currently, CSV data source has a pretty much different structure with JSON 
> data source although they can be pretty much similar.
> It would be great if they have the similar structure so that some common 
> issues can be resolved together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16101) Refactoring CSV data source to be consistent with JSON data source

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16101:


Assignee: (was: Apache Spark)

> Refactoring CSV data source to be consistent with JSON data source
> --
>
> Key: SPARK-16101
> URL: https://issues.apache.org/jira/browse/SPARK-16101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> Currently, CSV data source has a pretty much different structure with JSON 
> data source although they can be pretty much similar.
> It would be great if they have the similar structure so that some common 
> issues can be resolved together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16313) Spark should not silently drop exceptions in file listing

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16313:


Assignee: Apache Spark  (was: Reynold Xin)

> Spark should not silently drop exceptions in file listing
> -
>
> Key: SPARK-16313
> URL: https://issues.apache.org/jira/browse/SPARK-16313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16313) Spark should not silently drop exceptions in file listing

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16313:


Assignee: Reynold Xin  (was: Apache Spark)

> Spark should not silently drop exceptions in file listing
> -
>
> Key: SPARK-16313
> URL: https://issues.apache.org/jira/browse/SPARK-16313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16313) Spark should not silently drop exceptions in file listing

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356364#comment-15356364
 ] 

Apache Spark commented on SPARK-16313:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13987

> Spark should not silently drop exceptions in file listing
> -
>
> Key: SPARK-16313
> URL: https://issues.apache.org/jira/browse/SPARK-16313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16313) Spark should not silently drop exceptions in file listing

2016-06-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-16313:
---

 Summary: Spark should not silently drop exceptions in file listing
 Key: SPARK-16313
 URL: https://issues.apache.org/jira/browse/SPARK-16313
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16292) Failed to create spark client

2016-06-29 Thread Arcflash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356301#comment-15356301
 ] 

Arcflash commented on SPARK-16292:
--

Thanks ,I check my settings and it works fine

> Failed to create spark client
> -
>
> Key: SPARK-16292
> URL: https://issues.apache.org/jira/browse/SPARK-16292
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: hadoop-2.6.0
> hive-2.1.0
>Reporter: Arcflash
>
> when I use hive on spark ,I get this error
> {noformat}
> Failed to execute spark task, with exception 
> 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create spark 
> client.)'
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask
> {noformat}
> my settings
> {noformat}
> set hive.execution.engine=spark;
> set spark.home=/opt/spark1.6.0;
> set spark.master=192.168.3.111;
> set spark.eventLog.enabled=true;
> set spark.eventLog.dir=/tmp;
> set spark.executor.memory=512m; 
> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
> {noformat}
> Exeptions seen:
> {noformat}
> 16/06/30 16:17:02 [main]: INFO client.SparkClientImpl: Loading spark 
> defaults: file:/opt/hive2.1/conf/spark-defaults.conf
> 16/06/30 16:17:02 [main]: INFO client.SparkClientImpl: Running client driver 
> with argv: /opt/spark1.6.0/bin/spark-submit --properties-file 
> /tmp/spark-submit.7397226318023137500.properties --class 
> org.apache.hive.spark.client.RemoteDriver 
> /opt/hive2.1/lib/hive-exec-2.1.0.jar --remote-host master-0 --remote-port 
> 34055 --conf hive.spark.client.connect.timeout=1000 --conf 
> hive.spark.client.server.connect.timeout=9 --conf 
> hive.spark.client.channel.log.level=null --conf 
> hive.spark.client.rpc.max.size=52428800 --conf 
> hive.spark.client.rpc.threads=8 --conf hive.spark.client.secret.bits=256 
> --conf hive.spark.client.rpc.server.address=null
> 16/06/30 16:17:03 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
> Ignoring non-spark config property: 
> hive.spark.client.server.connect.timeout=9
> 16/06/30 16:17:03 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
> Ignoring non-spark config property: hive.spark.client.rpc.threads=8
> 16/06/30 16:17:03 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
> Ignoring non-spark config property: hive.spark.client.connect.timeout=1000
> 16/06/30 16:17:03 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
> Ignoring non-spark config property: hive.spark.client.secret.bits=256
> 16/06/30 16:17:03 [stderr-redir-1]: INFO client.SparkClientImpl: Warning: 
> Ignoring non-spark config property: hive.spark.client.rpc.max.size=52428800
> 16/06/30 16:17:03 [stderr-redir-1]: INFO client.SparkClientImpl: 16/06/30 
> 16:17:03 INFO client.RemoteDriver: Connecting to: master-0:34055
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl: Exception in 
> thread "main" java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> org.apache.hive.spark.client.rpc.RpcConfiguration.(RpcConfiguration.java:45)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> org.apache.hive.spark.client.RemoteDriver.(RemoteDriver.java:134)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> java.lang.reflect.Method.invoke(Method.java:497)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> 16/06/30 16:17:04 [stderr-redir-1]: INFO client.SparkClientImpl:at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 16/06/30 16:17:04 [main]: ERROR 

[jira] [Updated] (SPARK-16274) Implement xpath_boolean

2016-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16274:

Assignee: Peter Lee

> Implement xpath_boolean
> ---
>
> Key: SPARK-16274
> URL: https://issues.apache.org/jira/browse/SPARK-16274
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16274) Implement xpath_boolean

2016-06-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16274.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13964
[https://github.com/apache/spark/pull/13964]

> Implement xpath_boolean
> ---
>
> Key: SPARK-16274
> URL: https://issues.apache.org/jira/browse/SPARK-16274
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16312) Docs for Kafka 0.10 consumer integration

2016-06-29 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-16312:
--

 Summary: Docs for Kafka 0.10 consumer integration
 Key: SPARK-16312
 URL: https://issues.apache.org/jira/browse/SPARK-16312
 Project: Spark
  Issue Type: Sub-task
Reporter: Cody Koeninger






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-29 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356277#comment-15356277
 ] 

Yanbo Liang commented on SPARK-16144:
-

Sure. 

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16311:

Target Version/s: 2.0.0

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Peter Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356275#comment-15356275
 ] 

Peter Lee commented on SPARK-16311:
---

I can work on this one ...


> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16311:

Description: 
When the underlying file changes, it can be very confusing to users when they 
see a FileNotFoundException. It would be great to do the following:

(1) Append a message to the FileNotFoundException that a workaround is to do 
explicitly metadata refresh.
(2) Make metadata refresh work on temporary tables/views.
(3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
Dataset.refresh() method.


  was:
When the underlying file changes, it can be very confusing to users when they 
see a FileNotFoundException. It would be great to do the following:

(1) Append a message to the FileNotFoundException that a workaround is to do 
explicitly metadata refresh.
(2) Make metadata refresh work on temporary tables/views.
(3) Make metadata refresh work on Datasets/DataFrames.


> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames, by introducing a 
> Dataset.refresh() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16311:

Description: 
When the underlying file changes, it can be very confusing to users when they 
see a FileNotFoundException. It would be great to do the following:

(1) Append a message to the FileNotFoundException that a workaround is to do 
explicitly metadata refresh.
(2) Make metadata refresh work on temporary tables/views.
(3) Make metadata refresh work on Datasets/DataFrames.

  was:
When the underlying file changes, it can be very confusing to users 


The refresh command currently only works on metastore-based relations. It 
should also work on temporary views/tables.


> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users when they 
> see a FileNotFoundException. It would be great to do the following:
> (1) Append a message to the FileNotFoundException that a workaround is to do 
> explicitly metadata refresh.
> (2) Make metadata refresh work on temporary tables/views.
> (3) Make metadata refresh work on Datasets/DataFrames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16311) "refresh" should work on temporary tables or views or Dataset/DataFrame

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16311:

Description: 
When the underlying file changes, it can be very confusing to users 


The refresh command currently only works on metastore-based relations. It 
should also work on temporary views/tables.

  was:
The refresh command currently only works on metastore-based relations. It 
should also work on temporary views/tables.



> "refresh" should work on temporary tables or views or Dataset/DataFrame
> ---
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users 
> The refresh command currently only works on metastore-based relations. It 
> should also work on temporary views/tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16311) Improve metadata refresh

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16311:

Summary: Improve metadata refresh  (was: "refresh" should work on temporary 
tables or views or Dataset/DataFrame)

> Improve metadata refresh
> 
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> When the underlying file changes, it can be very confusing to users 
> The refresh command currently only works on metastore-based relations. It 
> should also work on temporary views/tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16311) "refresh" should work on temporary tables or views or Dataset/DataFrame

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16311:

Summary: "refresh" should work on temporary tables or views or 
Dataset/DataFrame  (was: "refresh" should work on temporary tables or views)

> "refresh" should work on temporary tables or views or Dataset/DataFrame
> ---
>
> Key: SPARK-16311
> URL: https://issues.apache.org/jira/browse/SPARK-16311
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> The refresh command currently only works on metastore-based relations. It 
> should also work on temporary views/tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16114) Add network word count example

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15356255#comment-15356255
 ] 

Apache Spark commented on SPARK-16114:
--

User 'jjthomas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13957

> Add network word count example
> --
>
> Key: SPARK-16114
> URL: https://issues.apache.org/jira/browse/SPARK-16114
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: James Thomas
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16267) Replace deprecated `CREATE TEMPORARY TABLE` from testsuites

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16267.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Replace deprecated `CREATE TEMPORARY TABLE` from testsuites
> ---
>
> Key: SPARK-16267
> URL: https://issues.apache.org/jira/browse/SPARK-16267
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 2.0.0
>
>
> After SPARK-15674, `DDLStrategy` prints out the following deprecation 
> messages in the testsuites.
> {code}
> 12:10:53.284 WARN org.apache.spark.sql.execution.SparkStrategies$DDLStrategy: 
> CREATE TEMPORARY TABLE normal_orc_source USING... is deprecated, please use 
> CREATE TEMPORARY VIEW viewName USING... instead
> {code}
> - JDBCWriteSuite: 14
> - DDLSuite: 6
> - TableScanSuite: 6
> - ParquetSourceSuite: 5
> - OrcSourceSuite: 2
> - SQLQuerySuite: 2
> - HiveCommandSuite: 2
> - JsonSuite: 1
> - PrunedScanSuite: 1
> - FilteredScanSuite  1
> This PR replaces `CREATE TEMPORARY TABLE` with `CREATE TEMPORARY VIEW` in 
> order to remove the deprecation messages except `DDLSuite`, `SQLQuerySuite`, 
> `HiveCommandSuite`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16134) optimizer rules for typed filter

2016-06-29 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16134.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13846
[https://github.com/apache/spark/pull/13846]

> optimizer rules for typed filter
> 
>
> Key: SPARK-16134
> URL: https://issues.apache.org/jira/browse/SPARK-16134
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16308) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355992#comment-15355992
 ] 

Felix Cheung commented on SPARK-16308:
--

could someone please close this bug (I can't)?

> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16308
> URL: https://issues.apache.org/jira/browse/SPARK-16308
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16309) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355991#comment-15355991
 ] 

Felix Cheung commented on SPARK-16309:
--

could someone please close this bug (I can't)?

> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16309
> URL: https://issues.apache.org/jira/browse/SPARK-16309
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16308) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung closed SPARK-16308.


> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16308
> URL: https://issues.apache.org/jira/browse/SPARK-16308
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16308) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-16308.
--
Resolution: Duplicate

> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16308
> URL: https://issues.apache.org/jira/browse/SPARK-16308
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16310) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355985#comment-15355985
 ] 

Apache Spark commented on SPARK-16310:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/13984

> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16310
> URL: https://issues.apache.org/jira/browse/SPARK-16310
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16309) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355988#comment-15355988
 ] 

Felix Cheung commented on SPARK-16309:
--

dup of https://issues.apache.org/jira/browse/SPARK-16310


> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16309
> URL: https://issues.apache.org/jira/browse/SPARK-16309
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16308) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355989#comment-15355989
 ] 

Felix Cheung commented on SPARK-16308:
--

dup of https://issues.apache.org/jira/browse/SPARK-16310


> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16308
> URL: https://issues.apache.org/jira/browse/SPARK-16308
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16310) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16310:


Assignee: Apache Spark

> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16310
> URL: https://issues.apache.org/jira/browse/SPARK-16310
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16310) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16310:


Assignee: (was: Apache Spark)

> SparkR csv source should have the same default na.string as R
> -
>
> Key: SPARK-16310
> URL: https://issues.apache.org/jira/browse/SPARK-16310
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Felix Cheung
>Priority: Minor
>
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
> na.strings = "NA"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16021:


Assignee: Apache Spark

> Zero out freed memory in test to help catch correctness bugs
> 
>
> Key: SPARK-16021
> URL: https://issues.apache.org/jira/browse/SPARK-16021
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> In both on-heap and off-heap modes, it would be helpful to immediately zero 
> out (or otherwise fill with a sentinel value) memory when an object is 
> deallocated.
> Currently, in on-heap mode, freed memory can be accessed without visible 
> error if no other consumer has written to the same space. Similarly, off-heap 
> memory can be accessed without fault if the allocation library has not 
> released the pages back to the OS. Zeroing out freed memory would make these 
> errors immediately visible as a correctness problem.
> Since this would add some performance overhead, it would make sense to 
> conf-flag and enable only in test.
> cc [~sameerag] [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16310) SparkR csv source should have the same default na.string as R

2016-06-29 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-16310:


 Summary: SparkR csv source should have the same default na.string 
as R
 Key: SPARK-16310
 URL: https://issues.apache.org/jira/browse/SPARK-16310
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.2
Reporter: Felix Cheung
Priority: Minor


https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16304) LinkageError should not crash Spark executor

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355949#comment-15355949
 ] 

Apache Spark commented on SPARK-16304:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/13982

> LinkageError should not crash Spark executor
> 
>
> Key: SPARK-16304
> URL: https://issues.apache.org/jira/browse/SPARK-16304
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>
> If we have a linkage error in the user code, Spark executors get killed 
> immediately. This is not great for user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16228) "Percentile" needs explicit cast to double

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16228.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> "Percentile" needs explicit cast to double
> --
>
> Key: SPARK-16228
> URL: https://issues.apache.org/jira/browse/SPARK-16228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> {quote}
>  select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla
> {quote}
> Works.
> {quote}
>  select percentile(cast(id as bigint), 0.5 ) from temp.bla
> {quote}
> Throws
> {quote}
> Error in query: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
> decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
> _FUNC_(bigint, double)  ; line 1 pos 7
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16021:


Assignee: Apache Spark

> Zero out freed memory in test to help catch correctness bugs
> 
>
> Key: SPARK-16021
> URL: https://issues.apache.org/jira/browse/SPARK-16021
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>Assignee: Apache Spark
>
> In both on-heap and off-heap modes, it would be helpful to immediately zero 
> out (or otherwise fill with a sentinel value) memory when an object is 
> deallocated.
> Currently, in on-heap mode, freed memory can be accessed without visible 
> error if no other consumer has written to the same space. Similarly, off-heap 
> memory can be accessed without fault if the allocation library has not 
> released the pages back to the OS. Zeroing out freed memory would make these 
> errors immediately visible as a correctness problem.
> Since this would add some performance overhead, it would make sense to 
> conf-flag and enable only in test.
> cc [~sameerag] [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16304) LinkageError should not crash Spark executor

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16304:


Assignee: Apache Spark

> LinkageError should not crash Spark executor
> 
>
> Key: SPARK-16304
> URL: https://issues.apache.org/jira/browse/SPARK-16304
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> If we have a linkage error in the user code, Spark executors get killed 
> immediately. This is not great for user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16021) Zero out freed memory in test to help catch correctness bugs

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355951#comment-15355951
 ] 

Apache Spark commented on SPARK-16021:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/13983

> Zero out freed memory in test to help catch correctness bugs
> 
>
> Key: SPARK-16021
> URL: https://issues.apache.org/jira/browse/SPARK-16021
> Project: Spark
>  Issue Type: Improvement
>Reporter: Eric Liang
>
> In both on-heap and off-heap modes, it would be helpful to immediately zero 
> out (or otherwise fill with a sentinel value) memory when an object is 
> deallocated.
> Currently, in on-heap mode, freed memory can be accessed without visible 
> error if no other consumer has written to the same space. Similarly, off-heap 
> memory can be accessed without fault if the allocation library has not 
> released the pages back to the OS. Zeroing out freed memory would make these 
> errors immediately visible as a correctness problem.
> Since this would add some performance overhead, it would make sense to 
> conf-flag and enable only in test.
> cc [~sameerag] [~hvanhovell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16306) Improve testing for DecisionTree variances

2016-06-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16306.
---
Resolution: Duplicate

> Improve testing for DecisionTree variances
> --
>
> Key: SPARK-16306
> URL: https://issues.apache.org/jira/browse/SPARK-16306
> Project: Spark
>  Issue Type: Test
>Reporter: Manoj Kumar
>Priority: Minor
>
> The current test assumes that Impurity.calculate() returns the variance 
> correctly. A better approach would be to test if the variance returned equals 
> the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16305) LinkageError should not crash Spark executor

2016-06-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16305.
---
Resolution: Duplicate

> LinkageError should not crash Spark executor
> 
>
> Key: SPARK-16305
> URL: https://issues.apache.org/jira/browse/SPARK-16305
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>
> If we have a linkage error in the user code, Spark executors get killed 
> immediately. This is not great for user experience.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16307) Improve testing for DecisionTree variances

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16307:


Assignee: (was: Apache Spark)

> Improve testing for DecisionTree variances
> --
>
> Key: SPARK-16307
> URL: https://issues.apache.org/jira/browse/SPARK-16307
> Project: Spark
>  Issue Type: Test
>Reporter: Manoj Kumar
>Priority: Minor
>
> The current test assumes that Impurity.calculate() returns the variance 
> correctly. A better approach would be to test if the variance returned equals 
> the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16307) Improve testing for DecisionTree variances

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355896#comment-15355896
 ] 

Apache Spark commented on SPARK-16307:
--

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/13981

> Improve testing for DecisionTree variances
> --
>
> Key: SPARK-16307
> URL: https://issues.apache.org/jira/browse/SPARK-16307
> Project: Spark
>  Issue Type: Test
>Reporter: Manoj Kumar
>Priority: Minor
>
> The current test assumes that Impurity.calculate() returns the variance 
> correctly. A better approach would be to test if the variance returned equals 
> the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16307) Improve testing for DecisionTree variances

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16307:


Assignee: Apache Spark

> Improve testing for DecisionTree variances
> --
>
> Key: SPARK-16307
> URL: https://issues.apache.org/jira/browse/SPARK-16307
> Project: Spark
>  Issue Type: Test
>Reporter: Manoj Kumar
>Assignee: Apache Spark
>Priority: Minor
>
> The current test assumes that Impurity.calculate() returns the variance 
> correctly. A better approach would be to test if the variance returned equals 
> the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16307) Improve testing for DecisionTree variances

2016-06-29 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-16307:
---

 Summary: Improve testing for DecisionTree variances
 Key: SPARK-16307
 URL: https://issues.apache.org/jira/browse/SPARK-16307
 Project: Spark
  Issue Type: Test
Reporter: Manoj Kumar
Priority: Minor


The current test assumes that Impurity.calculate() returns the variance 
correctly. A better approach would be to test if the variance returned equals 
the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16006) Attemping to write empty DataFrame with no fields throw non-intuitive exception

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16006.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Attemping to write empty DataFrame with no fields throw non-intuitive 
> exception
> ---
>
> Key: SPARK-16006
> URL: https://issues.apache.org/jira/browse/SPARK-16006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tathagata Das
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> Attempting to write an emptyDataFrame created with 
> {{sparkSession.emptyDataFrame.write.text("p")}} fails with the following 
> exception
> {code}
> org.apache.spark.sql.AnalysisException: Cannot use all columns for partition 
> columns;
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:355)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:435)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:213)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:196)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:525)
>   ... 48 elided
> {code}
> This is because # fields == # partitioning columns  = 0 at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:355).
>  This is a non-intuitive error message. Better error message "Cannot write 
> dataset with no fields".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16306) Improve testing for DecisionTree variances

2016-06-29 Thread Manoj Kumar (JIRA)
Manoj Kumar created SPARK-16306:
---

 Summary: Improve testing for DecisionTree variances
 Key: SPARK-16306
 URL: https://issues.apache.org/jira/browse/SPARK-16306
 Project: Spark
  Issue Type: Test
Reporter: Manoj Kumar
Priority: Minor


The current test assumes that Impurity.calculate() returns the variance 
correctly. A better approach would be to test if the variance returned equals 
the variance that we can manually verify on a toy data and tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16305) LinkageError should not crash Spark executor

2016-06-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-16305:
---

 Summary: LinkageError should not crash Spark executor
 Key: SPARK-16305
 URL: https://issues.apache.org/jira/browse/SPARK-16305
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin


If we have a linkage error in the user code, Spark executors get killed 
immediately. This is not great for user experience.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16304) LinkageError should not crash Spark executor

2016-06-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-16304:
---

 Summary: LinkageError should not crash Spark executor
 Key: SPARK-16304
 URL: https://issues.apache.org/jira/browse/SPARK-16304
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Reynold Xin


If we have a linkage error in the user code, Spark executors get killed 
immediately. This is not great for user experience.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16260) PySpark ML Example Improvements and Cleanup

2016-06-29 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355810#comment-15355810
 ] 

Miao Wang commented on SPARK-16260:
---

[~bryanc]I can help on the QA. Will you create sub-tasks or you want to use 
this JIRA for all related PRs? Thanks!

> PySpark ML Example Improvements and Cleanup
> ---
>
> Key: SPARK-16260
> URL: https://issues.apache.org/jira/browse/SPARK-16260
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> This parent task is to track a few possible improvements and cleanup for 
> PySpark ML examples I noticed during 2.0 QA.  These include:
> * Parity with Scala ML examples
> * Ensure input format is documented in example
> * Ensure results of example are clear and demonstrate functionality
> * Cleanup unused imports
> * Fix minor issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16044) input_file_name() returns empty strings in data sources based on NewHadoopRDD.

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16044:

Fix Version/s: 1.6.3

> input_file_name() returns empty strings in data sources based on NewHadoopRDD.
> --
>
> Key: SPARK-16044
> URL: https://issues.apache.org/jira/browse/SPARK-16044
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 1.6.3, 2.0.0
>
>
> The issue is, {{input_file_name()}} function does not contain file paths when 
> data sources use {{NewHadoopRDD}}. This is currently only supported for 
> {{FileScanRDD}} and {{HadoopRDD}}.
> To be clear, this does not affect Spark's internal data sources because 
> currently they all do not use {{NewHadoopRDD}}.
> However, there are several datasources using this. For example,
>  
> spark-redshift - 
> [here|https://github.com/databricks/spark-redshift/blob/cba5eee1ab79ae8f0fa9e668373a54d2b5babf6b/src/main/scala/com/databricks/spark/redshift/RedshiftRelation.scala#L149]
> spark-xml - 
> [here|https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala#L39-L47]
> Currently, using this functions shows the output below:
> {code}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16303) Update SQL examples and programming guide

2016-06-29 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-16303:
--

 Summary: Update SQL examples and programming guide
 Key: SPARK-16303
 URL: https://issues.apache.org/jira/browse/SPARK-16303
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Examples
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


We need to update SQL examples code under the {{examples}} sub-project, and 
then replace hard-coded snippets in the SQL programming guide with snippets 
automatically extracted from actual source files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16302) Set the right number of partitions for reading data from a local collection.

2016-06-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16302:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Set the right number of partitions for reading data from a local collection.
> 
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Lianhui Wang
>Priority: Minor
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16198) Change the access level of the predict method in spark.ml.Predictor to public

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16198:


Assignee: (was: Apache Spark)

> Change the access level of the predict method in spark.ml.Predictor to public
> -
>
> Key: SPARK-16198
> URL: https://issues.apache.org/jira/browse/SPARK-16198
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hussein Hazimeh
>Priority: Minor
>  Labels: latency, performance
>
> h1. Summary
> The transform method of predictors in spark.ml has a relatively high latency 
> when predicting single instances or small batches, which is mainly due to the 
> overhead introduced by DataFrame operations. For a text classification task 
> on the RCV1 datatset, changing the access level of the low-level "predict" 
> method from protected to public and using it to make predictions reduced the 
> latency of single predictions by three to four folds and that of batches by 
> 50%. While the transform method is flexible and sufficient for general usage, 
> exposing the low-level predict method to the public API can benefit many 
> applications that require low-latency response.
> h1. Experiment
> I performed an experiment to measure the latency of single instance 
> predictions in Spark and some other popular ML toolkits. Specifically, I'm 
> looking at the the time it takes to predict or classify a feature vector 
> residing in memory after the model is trained.
> For each toolkit in the table below, logistic regression was trained on the 
> Reuters RCV1 dataset which contains 697,641 documents and 47,236 features 
> stored in LIBSVM format along with binary labels. Then the wall-clock time 
> required to classify each document in a sample of 100,000 documents is 
> measured, and the 50th, 90th, and 99th percentiles and the maximum time are 
> reported. 
> All toolkits were tested on a desktop machine with an i7-6700 processor and 
> 16 GB memory, running Ubuntu 14.04 and OpenBLAS. The wall clock resolution is 
> 80ns for Python and 20ns for Scala.
> h1. Results
> The table below shows the latency of predictions for single instances in 
> milliseconds, sorted by P90. Spark and Spark 2 refer to versions 1.6.1 and 
> 2.0.0-SNAPSHOT (on master), respectively. In {color:blue}Spark 
> (Modified){color} and {color:blue}Spark 2 (Modified){color},  I changed the 
> access level of the predict method from protected to public and used it to 
> perform the predictions instead of transform. 
> ||Toolkit||API||P50||P90||P99||Max||
> |Spark|MLLIB (Scala)|0.0002|0.0015|0.0028|0.0685|
> |{color:blue}Spark 2 (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0004|0.0031|0.0087|0.3979|
> |{color:blue}Spark (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0013|0.0061|0.0632|0.4924|
> |Spark|MLLIB (Python)|0.0065|0.0075|0.0212|0.1579|
> |Scikit-Learn|Python|0.0341|0.0460|0.0849|0.2285|
> |LIBLINEAR|Python|0.0669|0.1484|0.2623|1.7322|
> |{color:red}Spark{color}|{color:red}ML 
> (Scala){color}|2.3713|2.9577|4.6521|511.2448|
> |{color:red}Spark 2{color}|{color:red}ML 
> (Scala){color}|8.4603|9.4352|13.2143|292.8733|
> |BIDMach (CPU)|Scala|5.4061|49.1362|102.2563|12040.6773|
> |BIDMach (GPU)|Scala|471.3460|477.8214|485.9805|807.4782|
> The results show that spark.mllib has the lowest latency among all other 
> toolkits and APIs, and this can be attributed to its low-level prediction 
> function that operates directly on the feature vector. However, spark.ml has 
> a relatively high latency which is in the order of 3ms for Spark 1.6.1 and 
> 10ms for Spark 2.0.0. Profiling the transform method of logistic regression 
> in spark.ml showed that only 0.01% of the time is being spent in doing the 
> dot product and logit transformation, while the rest of the time is dominated 
> by the DataFrame operations (mostly the “withColumn” operation that appends 
> the predictions column(s) to the input DataFrame). The results of the 
> modified versions of spark.ml, which directly use the predict method, 
> validate this observation as the latency is reduced by three to four folds.
> Since Spark splits batch predictions into a series of single-instance 
> predictions, reducing the latency of single predictions can lead to lower 
> latencies in batch predictions. I tried batch predictions in spark.ml (1.6.1) 
> using testing_features.map(x => model.predict( x)).collect() instead of 
> model.transform(testing_dataframe).select(“prediction”).collect(), and the 
> former had roughly 50% less latency for batches of size 1000, 10,000, and 
> 100,000.
> Although the experiment is constrained to logistic regression, other 
> predictors in the classification, regression, and clustering modules can 
> suffer from the same problem 

[jira] [Commented] (SPARK-16198) Change the access level of the predict method in spark.ml.Predictor to public

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355687#comment-15355687
 ] 

Apache Spark commented on SPARK-16198:
--

User 'husseinhazimeh' has created a pull request for this issue:
https://github.com/apache/spark/pull/13980

> Change the access level of the predict method in spark.ml.Predictor to public
> -
>
> Key: SPARK-16198
> URL: https://issues.apache.org/jira/browse/SPARK-16198
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hussein Hazimeh
>Priority: Minor
>  Labels: latency, performance
>
> h1. Summary
> The transform method of predictors in spark.ml has a relatively high latency 
> when predicting single instances or small batches, which is mainly due to the 
> overhead introduced by DataFrame operations. For a text classification task 
> on the RCV1 datatset, changing the access level of the low-level "predict" 
> method from protected to public and using it to make predictions reduced the 
> latency of single predictions by three to four folds and that of batches by 
> 50%. While the transform method is flexible and sufficient for general usage, 
> exposing the low-level predict method to the public API can benefit many 
> applications that require low-latency response.
> h1. Experiment
> I performed an experiment to measure the latency of single instance 
> predictions in Spark and some other popular ML toolkits. Specifically, I'm 
> looking at the the time it takes to predict or classify a feature vector 
> residing in memory after the model is trained.
> For each toolkit in the table below, logistic regression was trained on the 
> Reuters RCV1 dataset which contains 697,641 documents and 47,236 features 
> stored in LIBSVM format along with binary labels. Then the wall-clock time 
> required to classify each document in a sample of 100,000 documents is 
> measured, and the 50th, 90th, and 99th percentiles and the maximum time are 
> reported. 
> All toolkits were tested on a desktop machine with an i7-6700 processor and 
> 16 GB memory, running Ubuntu 14.04 and OpenBLAS. The wall clock resolution is 
> 80ns for Python and 20ns for Scala.
> h1. Results
> The table below shows the latency of predictions for single instances in 
> milliseconds, sorted by P90. Spark and Spark 2 refer to versions 1.6.1 and 
> 2.0.0-SNAPSHOT (on master), respectively. In {color:blue}Spark 
> (Modified){color} and {color:blue}Spark 2 (Modified){color},  I changed the 
> access level of the predict method from protected to public and used it to 
> perform the predictions instead of transform. 
> ||Toolkit||API||P50||P90||P99||Max||
> |Spark|MLLIB (Scala)|0.0002|0.0015|0.0028|0.0685|
> |{color:blue}Spark 2 (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0004|0.0031|0.0087|0.3979|
> |{color:blue}Spark (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0013|0.0061|0.0632|0.4924|
> |Spark|MLLIB (Python)|0.0065|0.0075|0.0212|0.1579|
> |Scikit-Learn|Python|0.0341|0.0460|0.0849|0.2285|
> |LIBLINEAR|Python|0.0669|0.1484|0.2623|1.7322|
> |{color:red}Spark{color}|{color:red}ML 
> (Scala){color}|2.3713|2.9577|4.6521|511.2448|
> |{color:red}Spark 2{color}|{color:red}ML 
> (Scala){color}|8.4603|9.4352|13.2143|292.8733|
> |BIDMach (CPU)|Scala|5.4061|49.1362|102.2563|12040.6773|
> |BIDMach (GPU)|Scala|471.3460|477.8214|485.9805|807.4782|
> The results show that spark.mllib has the lowest latency among all other 
> toolkits and APIs, and this can be attributed to its low-level prediction 
> function that operates directly on the feature vector. However, spark.ml has 
> a relatively high latency which is in the order of 3ms for Spark 1.6.1 and 
> 10ms for Spark 2.0.0. Profiling the transform method of logistic regression 
> in spark.ml showed that only 0.01% of the time is being spent in doing the 
> dot product and logit transformation, while the rest of the time is dominated 
> by the DataFrame operations (mostly the “withColumn” operation that appends 
> the predictions column(s) to the input DataFrame). The results of the 
> modified versions of spark.ml, which directly use the predict method, 
> validate this observation as the latency is reduced by three to four folds.
> Since Spark splits batch predictions into a series of single-instance 
> predictions, reducing the latency of single predictions can lead to lower 
> latencies in batch predictions. I tried batch predictions in spark.ml (1.6.1) 
> using testing_features.map(x => model.predict( x)).collect() instead of 
> model.transform(testing_dataframe).select(“prediction”).collect(), and the 
> former had roughly 50% less latency for batches of size 1000, 10,000, and 
> 100,000.
> Although the experiment is constrained to logistic regression, other 
> 

[jira] [Assigned] (SPARK-16198) Change the access level of the predict method in spark.ml.Predictor to public

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16198:


Assignee: Apache Spark

> Change the access level of the predict method in spark.ml.Predictor to public
> -
>
> Key: SPARK-16198
> URL: https://issues.apache.org/jira/browse/SPARK-16198
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Hussein Hazimeh
>Assignee: Apache Spark
>Priority: Minor
>  Labels: latency, performance
>
> h1. Summary
> The transform method of predictors in spark.ml has a relatively high latency 
> when predicting single instances or small batches, which is mainly due to the 
> overhead introduced by DataFrame operations. For a text classification task 
> on the RCV1 datatset, changing the access level of the low-level "predict" 
> method from protected to public and using it to make predictions reduced the 
> latency of single predictions by three to four folds and that of batches by 
> 50%. While the transform method is flexible and sufficient for general usage, 
> exposing the low-level predict method to the public API can benefit many 
> applications that require low-latency response.
> h1. Experiment
> I performed an experiment to measure the latency of single instance 
> predictions in Spark and some other popular ML toolkits. Specifically, I'm 
> looking at the the time it takes to predict or classify a feature vector 
> residing in memory after the model is trained.
> For each toolkit in the table below, logistic regression was trained on the 
> Reuters RCV1 dataset which contains 697,641 documents and 47,236 features 
> stored in LIBSVM format along with binary labels. Then the wall-clock time 
> required to classify each document in a sample of 100,000 documents is 
> measured, and the 50th, 90th, and 99th percentiles and the maximum time are 
> reported. 
> All toolkits were tested on a desktop machine with an i7-6700 processor and 
> 16 GB memory, running Ubuntu 14.04 and OpenBLAS. The wall clock resolution is 
> 80ns for Python and 20ns for Scala.
> h1. Results
> The table below shows the latency of predictions for single instances in 
> milliseconds, sorted by P90. Spark and Spark 2 refer to versions 1.6.1 and 
> 2.0.0-SNAPSHOT (on master), respectively. In {color:blue}Spark 
> (Modified){color} and {color:blue}Spark 2 (Modified){color},  I changed the 
> access level of the predict method from protected to public and used it to 
> perform the predictions instead of transform. 
> ||Toolkit||API||P50||P90||P99||Max||
> |Spark|MLLIB (Scala)|0.0002|0.0015|0.0028|0.0685|
> |{color:blue}Spark 2 (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0004|0.0031|0.0087|0.3979|
> |{color:blue}Spark (Modified){color}|{color:blue}ML 
> (Scala){color}|0.0013|0.0061|0.0632|0.4924|
> |Spark|MLLIB (Python)|0.0065|0.0075|0.0212|0.1579|
> |Scikit-Learn|Python|0.0341|0.0460|0.0849|0.2285|
> |LIBLINEAR|Python|0.0669|0.1484|0.2623|1.7322|
> |{color:red}Spark{color}|{color:red}ML 
> (Scala){color}|2.3713|2.9577|4.6521|511.2448|
> |{color:red}Spark 2{color}|{color:red}ML 
> (Scala){color}|8.4603|9.4352|13.2143|292.8733|
> |BIDMach (CPU)|Scala|5.4061|49.1362|102.2563|12040.6773|
> |BIDMach (GPU)|Scala|471.3460|477.8214|485.9805|807.4782|
> The results show that spark.mllib has the lowest latency among all other 
> toolkits and APIs, and this can be attributed to its low-level prediction 
> function that operates directly on the feature vector. However, spark.ml has 
> a relatively high latency which is in the order of 3ms for Spark 1.6.1 and 
> 10ms for Spark 2.0.0. Profiling the transform method of logistic regression 
> in spark.ml showed that only 0.01% of the time is being spent in doing the 
> dot product and logit transformation, while the rest of the time is dominated 
> by the DataFrame operations (mostly the “withColumn” operation that appends 
> the predictions column(s) to the input DataFrame). The results of the 
> modified versions of spark.ml, which directly use the predict method, 
> validate this observation as the latency is reduced by three to four folds.
> Since Spark splits batch predictions into a series of single-instance 
> predictions, reducing the latency of single predictions can lead to lower 
> latencies in batch predictions. I tried batch predictions in spark.ml (1.6.1) 
> using testing_features.map(x => model.predict( x)).collect() instead of 
> model.transform(testing_dataframe).select(“prediction”).collect(), and the 
> former had roughly 50% less latency for batches of size 1000, 10,000, and 
> 100,000.
> Although the experiment is constrained to logistic regression, other 
> predictors in the classification, regression, and clustering modules can 
> 

[jira] [Assigned] (SPARK-16302) Set the right number of partitions for reading data from a local collection.

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16302:


Assignee: (was: Apache Spark)

> Set the right number of partitions for reading data from a local collection.
> 
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16302) Set the right number of partitions for reading data from a local collection.

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16302:


Assignee: Apache Spark

> Set the right number of partitions for reading data from a local collection.
> 
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>Assignee: Apache Spark
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16302) Set the right number of partitions for reading data from a local collection.

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355662#comment-15355662
 ] 

Apache Spark commented on SPARK-16302:
--

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/13979

> Set the right number of partitions for reading data from a local collection.
> 
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-29 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355648#comment-15355648
 ] 

Xin Ren edited comment on SPARK-16144 at 6/29/16 6:57 PM:
--

Sure, thanks Xiangrui :)


was (Author: iamshrek):
Sure, thank Xiangrui :)

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-29 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355648#comment-15355648
 ] 

Xin Ren commented on SPARK-16144:
-

Sure, thank Xiangrui :)

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16302) Set the right number of partitions for reading data from a local collection.

2016-06-29 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16302:
-
Summary: Set the right number of partitions for reading data from a local 
collection.  (was: Set the default number of partitions for reading data from a 
local collection.)

> Set the right number of partitions for reading data from a local collection.
> 
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16302) Set the default number of partitions for reading data from a local collection.

2016-06-29 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-16302:
-
Summary: Set the default number of partitions for reading data from a local 
collection.  (was: LocalTableScanExec always use defaultParallelism tasks even 
though it is very small seq.)

> Set the default number of partitions for reading data from a local collection.
> --
>
> Key: SPARK-16302
> URL: https://issues.apache.org/jira/browse/SPARK-16302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Lianhui Wang
>
> query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
> defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16256) Add Structured Streaming Programming Guide

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355629#comment-15355629
 ] 

Apache Spark commented on SPARK-16256:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13978

> Add Structured Streaming Programming Guide
> --
>
> Key: SPARK-16256
> URL: https://issues.apache.org/jira/browse/SPARK-16256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16302) LocalTableScanExec always use defaultParallelism tasks even though it is very small seq.

2016-06-29 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-16302:


 Summary: LocalTableScanExec always use defaultParallelism tasks 
even though it is very small seq.
 Key: SPARK-16302
 URL: https://issues.apache.org/jira/browse/SPARK-16302
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Lianhui Wang


query: val df = Seq[(Int, Int)]().toDF("key", "value").count it always use 
defaultParallelism tasks. So i cause run empty or small tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

2016-06-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14480.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.1.0

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> --
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355601#comment-15355601
 ] 

Xiangrui Meng commented on SPARK-16144:
---

[~iamshrek] Thanks for helping! I asked Yanbo to take a final pass since he is 
awake at this time:)

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-29 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355599#comment-15355599
 ] 

Xiangrui Meng commented on SPARK-16144:
---

[~yanboliang] Do you have time to make a final pass over the ML docs in SparkR? 
For this ticket, we can link write.ml and predict to `spark.glm`, 
`spark.kmeans`, etc and write one sentence for each generic method (rather than 
leaving the raw `write.ml` function name).

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16144:
--
Assignee: Yanbo Liang  (was: Xin Ren)

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16140) Group k-means method in generated doc

2016-06-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16140.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Group k-means method in generated doc
> -
>
> Key: SPARK-16140
> URL: https://issues.apache.org/jira/browse/SPARK-16140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.1, 2.1.0
>
>
> Follow SPARK-16107 and group the doc of spark.kmeans, predict(KM), 
> summary(KM), read/write.ml(KM) under Rd spark.kmeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16301) Analyzer rule for resolving using joins should respect case sensitivity setting

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355596#comment-15355596
 ] 

Apache Spark commented on SPARK-16301:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/13977

> Analyzer rule for resolving using joins should respect case sensitivity 
> setting
> ---
>
> Key: SPARK-16301
> URL: https://issues.apache.org/jira/browse/SPARK-16301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> Quick repro: Passes on Spark 1.6.x, but fails on 2.0
> {code}
> case class MyColumn(userId: Int, field: String)
> val ds = Seq(MyColumn(1, "a")).toDF
> ds.join(ds, Seq("userid"))
> {code}
> {code}
> stacktrace:
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:313)
>   at scala.None$.get(Option.scala:311)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$88.apply(Analyzer.scala:1844)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$88.apply(Analyzer.scala:1844)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16288) Implement inline table generating function

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355560#comment-15355560
 ] 

Apache Spark commented on SPARK-16288:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13976

> Implement inline table generating function
> --
>
> Key: SPARK-16288
> URL: https://issues.apache.org/jira/browse/SPARK-16288
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16285) Implement sentences SQL function

2016-06-29 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355563#comment-15355563
 ] 

Dongjoon Hyun commented on SPARK-16285:
---

I'll work on this.

> Implement sentences SQL function
> 
>
> Key: SPARK-16285
> URL: https://issues.apache.org/jira/browse/SPARK-16285
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16288) Implement inline table generating function

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16288:


Assignee: (was: Apache Spark)

> Implement inline table generating function
> --
>
> Key: SPARK-16288
> URL: https://issues.apache.org/jira/browse/SPARK-16288
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16288) Implement inline table generating function

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16288:


Assignee: Apache Spark

> Implement inline table generating function
> --
>
> Key: SPARK-16288
> URL: https://issues.apache.org/jira/browse/SPARK-16288
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16300) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16300.
---
Resolution: Duplicate

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16300
> URL: https://issues.apache.org/jira/browse/SPARK-16300
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16299:


Assignee: (was: Apache Spark)

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16299:


Assignee: Apache Spark

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>Assignee: Apache Spark
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16299:


Assignee: (was: Apache Spark)

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16299:


Assignee: Apache Spark

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>Assignee: Apache Spark
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

2016-06-29 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355456#comment-15355456
 ] 

Bryan Cutler edited comment on SPARK-16247 at 6/29/16 3:53 PM:
---

I think you need to specify the {{labelCol}} in 
{{MulticlassClassificationEvaluator}}, otherwise it will default to "label", 
which is the original string labels.  So like this:

{noformat}
cvEvaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", 
metricName="precision")
{noformat}


was (Author: bryanc):
I think you need to specify the {labelCol} in 
{MulticlassClassificationEvaluator}, otherwise it will default to "label", 
which is the original string labels.  So like this:

{noformat}
cvEvaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", 
metricName="precision")
{noformat}

> Using pyspark dataframe with pipeline and cross validator
> -
>
> Key: SPARK-16247
> URL: https://issues.apache.org/jira/browse/SPARK-16247
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Edward Ma
>
> I am using pyspark with dataframe. Using pipeline operation to train and 
> predict the result. It is alright for single testing.
> However, I got issue when using pipeline and CrossValidator. The issue is 
> that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
> feature. Those fields are built by StringIndexer and VectorIndex. It suppose 
> to be existed after executing pipeline. 
> Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit 
> function and line 239, est.fit), I found that it does not execute pipeline 
> stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". 
> Would you mind advising whether my usage is correct or not.
> Thanks.
> Here is code snippet
> {noformat}
> // # Indexing
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(extracted_data)
> featureIndexer = VectorIndexer(inputCol="extracted_msg", 
> outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)
> // # Training
> classification_model = RandomForestClassifier(labelCol="indexedLabel", 
> featuresCol="indexedMsg", numTrees=50, maxDepth=20)
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, 
> classification_model])
> // # Cross Validation
> paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 
> 20, 30)).build()
> cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
> cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
> evaluator=cvEvaluator, numFolds=10)
> cvModel = cv.fit(trainingData)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355462#comment-15355462
 ] 

Apache Spark commented on SPARK-16299:
--

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/13975

> Capture errors from R workers in daemon.R to avoid deletion of R session 
> temporary directory
> 
>
> Key: SPARK-16299
> URL: https://issues.apache.org/jira/browse/SPARK-16299
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.2
>Reporter: Sun Rui
>
> Running SparkR unit tests randomly has the following error:
> Failed 
> -
> 1. Error: pipeRDD() on RDDs (@test_rdd.R#428) 
> --
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
> (TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
> with
>  [1] 1
> [1] 1
> [1] 2
> [1] 2
> [1] 3
> [1] 3
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> [1] 2
> ignoring SIGPIPE signal
> Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> 
> writeBin
> Execution halted
> cannot open the connection
> Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
> In addition: Warning message:
> In file(con, "w") :
>   cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or 
> directory
> Execution halted
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> This is related to daemon R worker mode. By default, SparkR launches an R 
> daemon worker per executor, and forks R workers from the daemon when 
> necessary.
> The problem about forking R worker is that all forked R processes share a 
> temporary directory, as documented at 
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
> When any forked R worker exits either normally or caused by errors, the 
> cleanup procedure of R will delete the temporary directory. This will affect 
> the still-running forked R workers because any temporary files created by 
> them under the temporary directories will be removed together. Also all 
> future R workers that will be forked from the daemon will be affected if they 
> use tempdir() or tempfile() to get tempoaray files because they will fail to 
> create temporary files under the already-deleted session temporary directory.
> So in order for the daemon mode to work, this problem should be circumvented. 
> In current dameon.R, R workers directly exits skipping the cleanup procedure 
> of R so that the shared temporary directory won't be deleted.
> {code}
>   source(script)
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}
> However, this is a bug in daemon.R, that when there is any execution error in 
> R workers, the error handling of R will finally go into the cleanup 
> procedure. So try() should be used in daemon.R to catch any error in R 
> workers, so that R workers will directly exit. 
> {code}
>   try(source(script))
>   # Set SIGUSR1 so that child can exit
>   tools::pskill(Sys.getpid(), tools::SIGUSR1)
>   parallel:::mcexit(0L)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

2016-06-29 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355456#comment-15355456
 ] 

Bryan Cutler commented on SPARK-16247:
--

I think you need to specify the {labelCol} in 
{MulticlassClassificationEvaluator}, otherwise it will default to "label", 
which is the original string labels.  So like this:

{noformat}
cvEvaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", 
metricName="precision")
{noformat}

> Using pyspark dataframe with pipeline and cross validator
> -
>
> Key: SPARK-16247
> URL: https://issues.apache.org/jira/browse/SPARK-16247
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Edward Ma
>
> I am using pyspark with dataframe. Using pipeline operation to train and 
> predict the result. It is alright for single testing.
> However, I got issue when using pipeline and CrossValidator. The issue is 
> that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
> feature. Those fields are built by StringIndexer and VectorIndex. It suppose 
> to be existed after executing pipeline. 
> Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit 
> function and line 239, est.fit), I found that it does not execute pipeline 
> stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". 
> Would you mind advising whether my usage is correct or not.
> Thanks.
> Here is code snippet
> {noformat}
> // # Indexing
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(extracted_data)
> featureIndexer = VectorIndexer(inputCol="extracted_msg", 
> outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)
> // # Training
> classification_model = RandomForestClassifier(labelCol="indexedLabel", 
> featuresCol="indexedMsg", numTrees=50, maxDepth=20)
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, 
> classification_model])
> // # Cross Validation
> paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 
> 20, 30)).build()
> cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
> cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
> evaluator=cvEvaluator, numFolds=10)
> cvModel = cv.fit(trainingData)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16300) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Sun Rui (JIRA)
Sun Rui created SPARK-16300:
---

 Summary: Capture errors from R workers in daemon.R to avoid 
deletion of R session temporary directory
 Key: SPARK-16300
 URL: https://issues.apache.org/jira/browse/SPARK-16300
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.2
Reporter: Sun Rui


Running SparkR unit tests randomly has the following error:

Failed -
1. Error: pipeRDD() on RDDs (@test_rdd.R#428) --
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
(TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
with
 [1] 1
[1] 1
[1] 2
[1] 2
[1] 3
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
ignoring SIGPIPE signal
Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> writeBin
Execution halted
cannot open the connection
Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
In addition: Warning message:
In file(con, "w") :
  cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or directory
Execution halted
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

This is related to daemon R worker mode. By default, SparkR launches an R 
daemon worker per executor, and forks R workers from the daemon when necessary.

The problem about forking R worker is that all forked R processes share a 
temporary directory, as documented at 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
When any forked R worker exits either normally or caused by errors, the cleanup 
procedure of R will delete the temporary directory. This will affect the 
still-running forked R workers because any temporary files created by them 
under the temporary directories will be removed together. Also all future R 
workers that will be forked from the daemon will be affected if they use 
tempdir() or tempfile() to get tempoaray files because they will fail to create 
temporary files under the already-deleted session temporary directory.

So in order for the daemon mode to work, this problem should be circumvented. 
In current dameon.R, R workers directly exits skipping the cleanup procedure of 
R so that the shared temporary directory won't be deleted.
{code}
  source(script)
  # Set SIGUSR1 so that child can exit
  tools::pskill(Sys.getpid(), tools::SIGUSR1)
  parallel:::mcexit(0L)
{code}

However, this is a bug in daemon.R, that when there is any execution error in R 
workers, the error handling of R will finally go into the cleanup procedure. So 
try() should be used in daemon.R to catch any error in R workers, so that R 
workers will directly exit. 
{code}
  try(source(script))
  # Set SIGUSR1 so that child can exit
  tools::pskill(Sys.getpid(), tools::SIGUSR1)
  parallel:::mcexit(0L)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16299) Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory

2016-06-29 Thread Sun Rui (JIRA)
Sun Rui created SPARK-16299:
---

 Summary: Capture errors from R workers in daemon.R to avoid 
deletion of R session temporary directory
 Key: SPARK-16299
 URL: https://issues.apache.org/jira/browse/SPARK-16299
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.2
Reporter: Sun Rui


Running SparkR unit tests randomly has the following error:

Failed -
1. Error: pipeRDD() on RDDs (@test_rdd.R#428) --
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 792.0 failed 1 times, most recent failure: Lost task 0.0 in stage 792.0 
(TID 1493, localhost): org.apache.spark.SparkException: R computation failed 
with
 [1] 1
[1] 1
[1] 2
[1] 2
[1] 3
[1] 3
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
[1] 2
ignoring SIGPIPE signal
Calls: source ...  -> lapply -> lapply -> FUN -> writeRaw -> writeBin
Execution halted
cannot open the connection
Calls: source ... computeFunc -> FUN -> system2 -> writeLines -> file
In addition: Warning message:
In file(con, "w") :
  cannot open file '/tmp/Rtmp0Gr1aU/file2de3efc94b3': No such file or directory
Execution halted
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

This is related to daemon R worker mode. By default, SparkR launches an R 
daemon worker per executor, and forks R workers from the daemon when necessary.

The problem about forking R worker is that all forked R processes share a 
temporary directory, as documented at 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/tempfile.html.
When any forked R worker exits either normally or caused by errors, the cleanup 
procedure of R will delete the temporary directory. This will affect the 
still-running forked R workers because any temporary files created by them 
under the temporary directories will be removed together. Also all future R 
workers that will be forked from the daemon will be affected if they use 
tempdir() or tempfile() to get tempoaray files because they will fail to create 
temporary files under the already-deleted session temporary directory.

So in order for the daemon mode to work, this problem should be circumvented. 
In current dameon.R, R workers directly exits skipping the cleanup procedure of 
R so that the shared temporary directory won't be deleted.
{code}
  source(script)
  # Set SIGUSR1 so that child can exit
  tools::pskill(Sys.getpid(), tools::SIGUSR1)
  parallel:::mcexit(0L)
{code}

However, this is a bug in daemon.R, that when there is any execution error in R 
workers, the error handling of R will finally go into the cleanup procedure. So 
try() should be used in daemon.R to catch any error in R workers, so that R 
workers will directly exit. 
{code}
  try(source(script))
  # Set SIGUSR1 so that child can exit
  tools::pskill(Sys.getpid(), tools::SIGUSR1)
  parallel:::mcexit(0L)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16297) Mapping Boolean and string to BIT and NVARCHAR(MAX) for SQL Server jdbc dialect

2016-06-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16297:
--
Issue Type: Bug  (was: Improvement)

> Mapping Boolean and string  to BIT and NVARCHAR(MAX) for SQL Server jdbc 
> dialect
> 
>
> Key: SPARK-16297
> URL: https://issues.apache.org/jira/browse/SPARK-16297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Oussama Mekni
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Tested with SQLServer 2012 and SQLServer Express:
> - Fix mapping of StringType to NVARCHAR(MAX)
> - Fix mapping of BooleanTypeto BIT



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16230) Executors self-killing after being assigned tasks while still in init

2016-06-29 Thread Tejas Patil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355339#comment-15355339
 ] 

Tejas Patil commented on SPARK-16230:
-

I am using 1.6.1. I will try out with the fix for SPARK-13112

> Executors self-killing after being assigned tasks while still in init
> -
>
> Key: SPARK-16230
> URL: https://issues.apache.org/jira/browse/SPARK-16230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Tejas Patil
>Priority: Minor
>
> I see this happening frequently in our prod clusters:
> * EXECUTOR:   
> [CoarseGrainedExecutorBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L61]
>  sends request to register itself to the driver.
> * DRIVER: Registers executor and 
> [replies|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L179]
> * EXECUTOR:  ExecutorBackend receives ACK and [starts creating an 
> Executor|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L81]
> * DRIVER:  Tries to launch a task as it knows there is a new executor. Sends 
> a 
> [LaunchTask|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L268]
>  to this new executor.
> * EXECUTOR:  Executor is not init'ed (one of the reasons I have seen is 
> because it was still trying to register to local external shuffle service). 
> Meanwhile, receives a `LaunchTask`. [Kills 
> itself|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L90]
>  as Executor is not init'ed.
> The driver assumes that Executor is ready to accept tasks as soon as it is 
> registered but thats not true.
> How this affects jobs / cluster:
> * We waste time + resources with these executors but they don't do any 
> meaningful computation.
> * Driver thinks that the executor has started running the task but since the 
> Executor has self killed, it does not tell driver (BTW: this is also another 
> issue which I think could be fixed separately). Driver waits for 10 mins and 
> then declares the executor dead. This adds up to the latency of the job. 
> Plus, failure attempts also gets bumped up for the tasks despite the tasks 
> were never started. For unlucky tasks, this might cause the job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >