[jira] [Closed] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler

2016-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-15874.
---
Resolution: Not A Problem

> HBase rowkey optimization support for Hbase-Storage-handler
> ---
>
> Key: SPARK-15874
> URL: https://issues.apache.org/jira/browse/SPARK-15874
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` 
> for Hbase table support, which has poor optimization. for example, query such 
> as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do 
> full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or 
> ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for 
> example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = 
> hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as 
> hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have 
> several tables which has major-key such as userID, and they often store them 
> in HBase. But, several times people have requirement to do some analysis with 
> SQL, and these SQL will have good optimization if the SQL execution plan has 
> a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler

2016-06-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326223#comment-15326223
 ] 

Reynold Xin commented on SPARK-15874:
-

If you want to get fancy you can play with the experimental strategies in 
SQLContext. Take a look at that.

I'm going to mark this ticket as won't fix for now. Thanks!


> HBase rowkey optimization support for Hbase-Storage-handler
> ---
>
> Key: SPARK-15874
> URL: https://issues.apache.org/jira/browse/SPARK-15874
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` 
> for Hbase table support, which has poor optimization. for example, query such 
> as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do 
> full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or 
> ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for 
> example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = 
> hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as 
> hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have 
> several tables which has major-key such as userID, and they often store them 
> in HBase. But, several times people have requirement to do some analysis with 
> SQL, and these SQL will have good optimization if the SQL execution plan has 
> a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15901:


Assignee: Apache Spark

> Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
> --
>
> Key: SPARK-15901
> URL: https://issues.apache.org/jira/browse/SPARK-15901
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> So far, we do not have test cases for verifying whether the external 
> parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works 
> when users use non-default values. Adding test cases for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET

2016-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326217#comment-15326217
 ] 

Apache Spark commented on SPARK-15901:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/13622

> Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
> --
>
> Key: SPARK-15901
> URL: https://issues.apache.org/jira/browse/SPARK-15901
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> So far, we do not have test cases for verifying whether the external 
> parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works 
> when users use non-default values. Adding test cases for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15901:


Assignee: (was: Apache Spark)

> Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
> --
>
> Key: SPARK-15901
> URL: https://issues.apache.org/jira/browse/SPARK-15901
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> So far, we do not have test cases for verifying whether the external 
> parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works 
> when users use non-default values. Adding test cases for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET

2016-06-11 Thread Xiao Li (JIRA)
Xiao Li created SPARK-15901:
---

 Summary: Test Cases for CONVERT_METASTORE_ORC and 
CONVERT_METASTORE_PARQUET
 Key: SPARK-15901
 URL: https://issues.apache.org/jira/browse/SPARK-15901
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


So far, we do not have test cases for verifying whether the external parameters 
CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works when users 
use non-default values. Adding test cases for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15840) New csv reader does not "determine the input schema"

2016-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15840.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.0.0

> New csv reader does not "determine the input schema"
> 
>
> Key: SPARK-15840
> URL: https://issues.apache.org/jira/browse/SPARK-15840
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0
>Reporter: Ernst Sjöstrand
>Assignee: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> When testing the new csv reader I found that it would not determine the input 
> schema as is stated in the documentation.
> (I used this documentation: 
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
>  )
> So either there is a bug in the implementation or in the documentation.
> This also means that things like dateFormat are ignore it seems like.
> Here's a quick test in pyspark (using Python3):
> a = spark.read.csv("/home/ernst/test.csv")
> a.printSchema()
> print(a.dtypes)
> a.show()
> {noformat}
>  root
>   |-- _c0: string (nullable = true)
>  [('_c0', 'string')]
>  +---+
>  |_c0|
>  +---+
>  |  1|
>  |  2|
>  |  3|
>  |  4|
>  +---+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15860) Metrics for codegen size and perf

2016-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15860.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.0.0

> Metrics for codegen size and perf
> -
>
> Key: SPARK-15860
> URL: https://issues.apache.org/jira/browse/SPARK-15860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> We should expose codahale metrics for the codegen source text size and how 
> long it takes to compile. The size is particularly interesting, since the JVM 
> does have hard limits on how large methods can get.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode

2016-06-11 Thread niranda perera (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326203#comment-15326203
 ] 

niranda perera commented on SPARK-14736:


Hi guys, 

Any update on this? We are seeing this deadlock in our custom recovery mode 
impl quite often. 

Best

> Deadlock in registering applications while the Master is in the RECOVERING 
> mode
> ---
>
> Key: SPARK-14736
> URL: https://issues.apache.org/jira/browse/SPARK-14736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1, 1.5.0, 1.6.0
> Environment: unix, Spark cluster with a custom 
> StandaloneRecoveryModeFactory and a custom PersistenceEngine
>Reporter: niranda perera
>Priority: Critical
>
> I have encountered the following issue in the standalone recovery mode. 
> Let's say there was an application A running in the cluster. Due to some 
> issue, the entire cluster, together with the application A goes down. 
> Then later on, cluster comes back online, and the master then goes into the 
> 'recovering' mode, because it sees some apps, workers and drivers have 
> already been in the cluster from Persistence Engine. While in the recovery 
> process, the application comes back online, but now it would have a different 
> ID, let's say B. 
> But then, as per the master, application registration logic, this application 
> B will NOT be added to the 'waitingApps' with the message ""Attempted to 
> re-register application at same address". [1]
>   private def registerApplication(app: ApplicationInfo): Unit = {
> val appAddress = app.driver.address
> if (addressToApp.contains(appAddress)) {
>   logInfo("Attempted to re-register application at same address: " + 
> appAddress)
>   return
> }
> The problem here is, master is trying to recover application A, which is not 
> in there anymore. Therefore after the recovery process, app A will be 
> dropped. However app A's successor, app B was also omitted from the 
> 'waitingApps' list because it had the same address as App A previously. 
> This creates a deadlock in the cluster, app A nor app B is available in the 
> cluster. 
> When the master is in the RECOVERING mode, shouldn't it add all the 
> registering apps to a list first, and then after the recovery is completed 
> (once the unsuccessful recoveries are removed), deploy the apps which are new?
> This would sort this deadlock IMO?
> [1] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15898) DataFrameReader.text should return DataFrame

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15898:


Assignee: Apache Spark  (was: Wenchen Fan)

> DataFrameReader.text should return DataFrame
> 
>
> Key: SPARK-15898
> URL: https://issues.apache.org/jira/browse/SPARK-15898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> See discussion at https://github.com/apache/spark/pull/13604
> We want to maintain API compatibility for DataFrameReader.text, and will 
> introduce a new API called DataFrameReader.textFile which returns 
> Dataset[String].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15898) DataFrameReader.text should return DataFrame

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15898:


Assignee: Wenchen Fan  (was: Apache Spark)

> DataFrameReader.text should return DataFrame
> 
>
> Key: SPARK-15898
> URL: https://issues.apache.org/jira/browse/SPARK-15898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>
> See discussion at https://github.com/apache/spark/pull/13604
> We want to maintain API compatibility for DataFrameReader.text, and will 
> introduce a new API called DataFrameReader.textFile which returns 
> Dataset[String].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15898) DataFrameReader.text should return DataFrame

2016-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326193#comment-15326193
 ] 

Apache Spark commented on SPARK-15898:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13604

> DataFrameReader.text should return DataFrame
> 
>
> Key: SPARK-15898
> URL: https://issues.apache.org/jira/browse/SPARK-15898
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
>
> See discussion at https://github.com/apache/spark/pull/13604
> We want to maintain API compatibility for DataFrameReader.text, and will 
> introduce a new API called DataFrameReader.textFile which returns 
> Dataset[String].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15900) please add a map param on MQTTUtils.createStream for setting MqttConnectOptions

2016-06-11 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-15900:
---
Summary: please add a map param on MQTTUtils.createStream for setting 
MqttConnectOptions   (was: please add a map param on MQTTUtils.createStreamfor 
setting MqttConnectOptions )

> please add a map param on MQTTUtils.createStream for setting 
> MqttConnectOptions 
> 
>
> Key: SPARK-15900
> URL: https://issues.apache.org/jira/browse/SPARK-15900
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I notice that MQTTReceiver will create a connection with the method
> (org.eclipse.paho.client.mqttv3.MqttClient) client.connect()
> It just means client.connect(new MqttConnectOptions());
> this causes that we have to use the default MqttConnectOptions and can't set 
> other param like usename and password.
> please add a new method at MQTTUtils.createStream like 
> createStream(jssc.ssc, brokerUrl, topic, map,storageLevel)
> in order to make a none-default MqttConnectOptions.
> thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15900) please add a map param on MQTTUtils.createStreamfor setting MqttConnectOptions

2016-06-11 Thread lichenglin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lichenglin updated SPARK-15900:
---
Summary: please add a map param on MQTTUtils.createStreamfor setting 
MqttConnectOptions   (was: please add a map param on MQTTUtils.create for 
setting MqttConnectOptions )

> please add a map param on MQTTUtils.createStreamfor setting 
> MqttConnectOptions 
> ---
>
> Key: SPARK-15900
> URL: https://issues.apache.org/jira/browse/SPARK-15900
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: lichenglin
>
> I notice that MQTTReceiver will create a connection with the method
> (org.eclipse.paho.client.mqttv3.MqttClient) client.connect()
> It just means client.connect(new MqttConnectOptions());
> this causes that we have to use the default MqttConnectOptions and can't set 
> other param like usename and password.
> please add a new method at MQTTUtils.createStream like 
> createStream(jssc.ssc, brokerUrl, topic, map,storageLevel)
> in order to make a none-default MqttConnectOptions.
> thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15900) please add a map param on MQTTUtils.create for setting MqttConnectOptions

2016-06-11 Thread lichenglin (JIRA)
lichenglin created SPARK-15900:
--

 Summary: please add a map param on MQTTUtils.create for setting 
MqttConnectOptions 
 Key: SPARK-15900
 URL: https://issues.apache.org/jira/browse/SPARK-15900
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.6.1
Reporter: lichenglin


I notice that MQTTReceiver will create a connection with the method
(org.eclipse.paho.client.mqttv3.MqttClient) client.connect()
It just means client.connect(new MqttConnectOptions());
this causes that we have to use the default MqttConnectOptions and can't set 
other param like usename and password.

please add a new method at MQTTUtils.createStream like 
createStream(jssc.ssc, brokerUrl, topic, map,storageLevel)
in order to make a none-default MqttConnectOptions.
thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15857) Add Caller Context in Spark

2016-06-11 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326153#comment-15326153
 ] 

Sun Rui commented on SPARK-15857:
-

+1 for this feature

> Add Caller Context in Spark
> ---
>
> Key: SPARK-15857
> URL: https://issues.apache.org/jira/browse/SPARK-15857
> Project: Spark
>  Issue Type: New Feature
>Reporter: Weiqing Yang
>
> Hadoop has implemented a feature of log tracing – caller context (Jira: 
> HDFS-9184 and YARN-4349). The motivation is to better diagnose and understand 
> how specific applications impacting parts of the Hadoop system and potential 
> problems they may be creating (e.g. overloading NN). As HDFS mentioned in 
> HDFS-9184, for a given HDFS operation, it's very helpful to track which upper 
> level job issues it. The upper level callers may be specific Oozie tasks, MR 
> jobs, hive queries, Spark jobs. 
> Hadoop ecosystems like MapReduce, Tez (TEZ-2851), Hive (HIVE-12249, 
> HIVE-12254) and Pig(PIG-4714) have implemented their caller contexts. Those 
> systems invoke HDFS client API and Yarn client API to setup caller context, 
> and also expose an API to pass in caller context into it.
> Lots of Spark applications are running on Yarn/HDFS. Spark can also implement 
> its caller context via invoking HDFS/Yarn API, and also expose an API to its 
> upstream applications to set up their caller contexts. In the end, the spark 
> caller context written into Yarn log / HDFS log can associate with task id, 
> stage id, job id and app id. That is also very good for Spark users to 
> identify tasks especially if Spark supports multi-tenant environment in the 
> future.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-06-11 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326146#comment-15326146
 ] 

Sun Rui commented on SPARK-15799:
-

This request has been asked before. The question is that SparkR needs co-work 
with the Spark distribution with the matching version. I think releasing SparkR 
on CRAN will promote the adoption of it. So we need find a release model for 
it. My thought is as follows:
1. Release SparkR R portion as SparkR package on CRAN following the normal R 
package convention. The package contains the matching Spark version and link 
for the spark distribution. The package has .onLoad() function. When it is 
loaded, .onLoad() will check if there is a local spark distribution installed. 
If not, it will attempt to download the distribution from the link and saving 
into a proper location. The SparkR CRAN package depends on the Spark 
distribution for the RBackend, for local mode execution and for remote cluster 
connection. .onLoad() will set SPARK_HOME if if finds the spark distribution.
2. Add a version check mechanism. So SparkR can check it matches the remote 
cluster if remote cluster deploying mode is desired.
3. R users don't need special scripts like bin/sparkR or bin/spark-submit for 
using SparkR. They can just start R, load SparkR library(). or running a SparkR 
script from the command line. In SparkR.init(), version check is performed and 
if no match, error message will be displayed.


> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler

2016-06-11 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326127#comment-15326127
 ] 

Weichen Xu commented on SPARK-15874:


en...I got it.

But there is another problem, if I want to implements a HBase connector, I 
found it can't be optimized well under current Spark-SQL architecture, for 
example , the generated execute plan can't pass proper rowkey to the underlying 
HBase table reader(if I want to implement the small-table driven join 
optimization describe above...) so if the spark-SQL execution planner can add 
several support for such optimization techniques(which can take advantage of 
underlying table with index to boost SQL execution)?

> HBase rowkey optimization support for Hbase-Storage-handler
> ---
>
> Key: SPARK-15874
> URL: https://issues.apache.org/jira/browse/SPARK-15874
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` 
> for Hbase table support, which has poor optimization. for example, query such 
> as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do 
> full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or 
> ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for 
> example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = 
> hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as 
> hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have 
> several tables which has major-key such as userID, and they often store them 
> in HBase. But, several times people have requirement to do some analysis with 
> SQL, and these SQL will have good optimization if the SQL execution plan has 
> a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2623) Stacked Auto Encoder (Deep Learning )

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2623:
---

Assignee: Apache Spark  (was: Victor Fang)

> Stacked Auto Encoder (Deep Learning )
> -
>
> Key: SPARK-2623
> URL: https://issues.apache.org/jira/browse/SPARK-2623
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Victor Fang
>Assignee: Apache Spark
>  Labels: deeplearning, machine_learning
>
> We would like to add parallel implementation of  Stacked Auto Encoder (Deep 
> Learning ) algorithm to Spark MLLib.
> SAE is one of the most popular Deep Learning algorithms. It has achieved 
> successful benchmarks in MNIST hand written classifications, Google's 
> ICML2012 "cat face" paper (http://icml.cc/2012/papers/73.pdf), etc.
> Our focus is to leverage the RDD and get the SAE with the following 
> capability with ease of use for both beginners and advanced researchers:
> 1, multi layer SAE deep network training and scoring.
> 2, unsupervised feature learning.
> 3, supervised learning with multinomial logistic regression (softmax). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2623) Stacked Auto Encoder (Deep Learning )

2016-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326126#comment-15326126
 ] 

Apache Spark commented on SPARK-2623:
-

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/13621

> Stacked Auto Encoder (Deep Learning )
> -
>
> Key: SPARK-2623
> URL: https://issues.apache.org/jira/browse/SPARK-2623
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Victor Fang
>Assignee: Victor Fang
>  Labels: deeplearning, machine_learning
>
> We would like to add parallel implementation of  Stacked Auto Encoder (Deep 
> Learning ) algorithm to Spark MLLib.
> SAE is one of the most popular Deep Learning algorithms. It has achieved 
> successful benchmarks in MNIST hand written classifications, Google's 
> ICML2012 "cat face" paper (http://icml.cc/2012/papers/73.pdf), etc.
> Our focus is to leverage the RDD and get the SAE with the following 
> capability with ease of use for both beginners and advanced researchers:
> 1, multi layer SAE deep network training and scoring.
> 2, unsupervised feature learning.
> 3, supervised learning with multinomial logistic regression (softmax). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler

2016-06-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326119#comment-15326119
 ] 

Reynold Xin commented on SPARK-15874:
-

Got it. There are already multiple hbase connectors for Spark SQL outside the 
Spark project, and that's a good way to evolve the ecosystem. In practice there 
are a lot of users that are using various key value stores and we can't create 
built-in connectors for all of them. Definitely feel free to contribute to an 
existing one or create another one that's better and put it on 
https://spark-packages.org/.

> HBase rowkey optimization support for Hbase-Storage-handler
> ---
>
> Key: SPARK-15874
> URL: https://issues.apache.org/jira/browse/SPARK-15874
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` 
> for Hbase table support, which has poor optimization. for example, query such 
> as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do 
> full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or 
> ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for 
> example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = 
> hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as 
> hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have 
> several tables which has major-key such as userID, and they often store them 
> in HBase. But, several times people have requirement to do some analysis with 
> SQL, and these SQL will have good optimization if the SQL execution plan has 
> a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler

2016-06-11 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326114#comment-15326114
 ] 

Weichen Xu commented on SPARK-15874:


The hbase connector is implements in hive and spark-SQL can use it, for example 
create a hbase external table in spark-SQL:
CREATE EXTERNAL TABLE hbase_tab1 (
rowkey string,
f1 map,
f2 map,
f3 map
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:,f2:,f3:")
TBLPROPERTIES ("hbase.table.name" = "tab1");

so...I think spark-SQL can do a more optimized hbase connector replace the one 
implemented in hive.

> HBase rowkey optimization support for Hbase-Storage-handler
> ---
>
> Key: SPARK-15874
> URL: https://issues.apache.org/jira/browse/SPARK-15874
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` 
> for Hbase table support, which has poor optimization. for example, query such 
> as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do 
> full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or 
> ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for 
> example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = 
> hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as 
> hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have 
> several tables which has major-key such as userID, and they often store them 
> in HBase. But, several times people have requirement to do some analysis with 
> SQL, and these SQL will have good optimization if the SQL execution plan has 
> a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package

2016-06-11 Thread Matthias Boehm (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326101#comment-15326101
 ] 

Matthias Boehm commented on SPARK-15882:


I really like this direction and think it has the potential to become a higher 
level API for Spark ML, as data frames and data sets have become for Spark SQL.

If there is interest, we'd like to help contributing to this feature by porting 
over a subset of distributed linear algebra operations from SystemML.

General Goals: From my perspective, we should aim for an API that hides the 
underlying data representation (e.g., RDD/Dataset, sparse/dense, blocking 
configurations, block/row/coordinate, partitioning etc). Furthermore, it would 
be great to make it easy to swap out the used local matrix library. This 
approach would allow people to plug in their custom operations (e.g., native 
BLAS libraries/kernels or compressed block operations), while still relying on 
a common API and scheme for distributing blocks.

RDDs over Datasets: For the internal implementation, I would favor RDDs over 
Datasets because (1) RDDs allow for more flexibility (e.g., reduceByKey, 
combineByKey, partitioning-preserving operations), and (2) encoders don't offer 
much benefit for blocked representations as the per-block overhead is typically 
negligible. 

Basic Operations: Initially, I would start with a small well-defined set of 
operations including matrix multiplications, unary and binary operations (e.g., 
arithmetic/comparison), unary aggregates (e.g., sum/rowSums/colSums, 
min/max/mean/sd), reorg operations (transpose/diag/reshape/order), and 
cumulative aggregates (e.g., cumsum).

Towards Optimization: Internally, we could implement alternative operations but 
hide them under a common interface. For example, matrix multiplication would be 
exposed as 'multiply' (consistent with local linalg) - internally, however, we 
would select between alternative operations (see 
https://github.com/apache/incubator-systemml/blob/master/docs/devdocs/MatrixMultiplicationOperators.txt),
 based on a simple rule set or user-provided hints as done in Spark SQL. Later, 
we could think about a more sophisticated optimizer, potentially relying on the 
existing catalyst infrastructure. What do you think? 

> Discuss distributed linear algebra in spark.ml package
> --
>
> Key: SPARK-15882
> URL: https://issues.apache.org/jira/browse/SPARK-15882
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* 
> should be migrated to org.apache.spark.ml.
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this 
> move?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15899) file scheme should be used correctly

2016-06-11 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-15899:


 Summary: file scheme should be used correctly
 Key: SPARK-15899
 URL: https://issues.apache.org/jira/browse/SPARK-15899
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kazuaki Ishizaki


[A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as 
{{file://host/}} or {{file:///}}. 
[Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme]
[Some code 
stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58]
 use different prefix such as {{file:}}.

It would be good to prepare a utility method to correctly add {{file://host}} 
or {{file://} prefix.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler

2016-06-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326069#comment-15326069
 ] 

Reynold Xin commented on SPARK-15874:
-

I'm confused -- Apache Spark's code base itself does not include a hbase 
connector. Which one are you referring to?

> HBase rowkey optimization support for Hbase-Storage-handler
> ---
>
> Key: SPARK-15874
> URL: https://issues.apache.org/jira/browse/SPARK-15874
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Weichen Xu
>   Original Estimate: 720h
>  Remaining Estimate: 720h
>
> Currently, Spark-SQL use  `org.apache.hadoop.hive.hbase.HBaseStorageHandler` 
> for Hbase table support, which has poor optimization. for example, query such 
> as
> select * from hbase_tab1 where rowkey_col = 'abc';
> will cause full table scan(each table region turn into a scan split and do 
> full region scan).
> In fact, it is easy to implement the following optimization:
> 1.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc';`
> or
> `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or 
> ...;`
> can use hbase rowkey `Get`/`multiGet` API to execute efficiently.
> 2.
> SQL such as
> `select * from hbase_tab1 where rowkey_col = 'abc%';`
> can use hbase rowkey `Scan` API to execute efficiently.
> Higher-level SQL optimization will benefit from such optimization, for 
> example, there is a very small table(such as incremental Data) `small_tab1`,
> SQL such as
> `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = 
> hbase_tab1.rowkey_col`
> can use classic small-table driven join optimization:
> loop each record of small_tab1, and exact each small_tab1.key1 as 
> hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently.
> The scenario described above is very common, manay business system may have 
> several tables which has major-key such as userID, and they often store them 
> in HBase. But, several times people have requirement to do some analysis with 
> SQL, and these SQL will have good optimization if the SQL execution plan has 
> a good support to HBase rowkey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15807) Support varargs for dropDuplicates in Dataset/DataFrame

2016-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15807.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Support varargs for dropDuplicates in Dataset/DataFrame
> ---
>
> Key: SPARK-15807
> URL: https://issues.apache.org/jira/browse/SPARK-15807
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue adds `varargs`-types `dropDuplicates` functions in 
> `Dataset/DataFrame`. Currently, `dropDuplicates` supports only `Seq` or 
> `Array`.
> {code}
> scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2)))
> ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
> scala> ds.dropDuplicates(Seq("_1", "_2"))
> res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, 
> _2: int]
> scala> ds.dropDuplicates("_1", "_2")
> :26: error: overloaded method value dropDuplicates with alternatives:
>   (colNames: 
> Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   (colNames: 
> Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 
>   ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
>  cannot be applied to (String, String)
>ds.dropDuplicates("_1", "_2")
>   ^
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14851) Support radix sort with nullable longs

2016-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14851.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.1.0

> Support radix sort with nullable longs
> --
>
> Key: SPARK-14851
> URL: https://issues.apache.org/jira/browse/SPARK-14851
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.1.0
>
>
> The current radix sort cannot handle nullable longs, since there is no bit 
> pattern available to represents nulls. These cases are probably best handled 
> outside the radix sort logic e.g. by keeping nulls in a separate region of 
> the array.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15898) DataFrameReader.text should return DataFrame

2016-06-11 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-15898:
---

 Summary: DataFrameReader.text should return DataFrame
 Key: SPARK-15898
 URL: https://issues.apache.org/jira/browse/SPARK-15898
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan


See discussion at https://github.com/apache/spark/pull/13604

We want to maintain API compatibility for DataFrameReader.text, and will 
introduce a new API called DataFrameReader.textFile which returns 
Dataset[String].




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15856) Revert API breaking changes made in SQLContext.range

2016-06-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326065#comment-15326065
 ] 

Reynold Xin commented on SPARK-15856:
-

Note that we have decided to only revert the SQLContext.range API in this 
ticket.

> Revert API breaking changes made in SQLContext.range
> 
>
> Key: SPARK-15856
> URL: https://issues.apache.org/jira/browse/SPARK-15856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> In Spark 2.0, after unifying Datasets and DataFrames, we made two API 
> breaking changes:
> # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of 
> {{DataFrame}}
> # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of 
> {{DataFrame}}
> However, these two changes introduced several inconsistencies and problems:
> # {{spark.read.text()}} silently discards partitioned columns when reading a 
> partitioned table in text format since {{Dataset\[String\]}} only contains a 
> single field. Users have to use {{spark.read.format("text").load()}} to 
> workaround this, which is pretty confusing and error-prone.
> # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} 
> (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}.
> # When applying typed operations over Datasets returned by {{spark.range()}}, 
> weird schema changes may happen. Please refer to SPARK-15632 for more details.
> Due to these reasons, we decided to revert these two changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15856) Revert API breaking changes made in SQLContext.range

2016-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15856:

Summary: Revert API breaking changes made in SQLContext.range  (was: Revert 
API breaking changes made in DataFrameReader.text and SQLContext.range)

> Revert API breaking changes made in SQLContext.range
> 
>
> Key: SPARK-15856
> URL: https://issues.apache.org/jira/browse/SPARK-15856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> In Spark 2.0, after unifying Datasets and DataFrames, we made two API 
> breaking changes:
> # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of 
> {{DataFrame}}
> # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of 
> {{DataFrame}}
> However, these two changes introduced several inconsistencies and problems:
> # {{spark.read.text()}} silently discards partitioned columns when reading a 
> partitioned table in text format since {{Dataset\[String\]}} only contains a 
> single field. Users have to use {{spark.read.format("text").load()}} to 
> workaround this, which is pretty confusing and error-prone.
> # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} 
> (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}.
> # When applying typed operations over Datasets returned by {{spark.range()}}, 
> weird schema changes may happen. Please refer to SPARK-15632 for more details.
> Due to these reasons, we decided to revert these two changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15856) Revert API breaking changes made in SQLContext.range

2016-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15856.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Revert API breaking changes made in SQLContext.range
> 
>
> Key: SPARK-15856
> URL: https://issues.apache.org/jira/browse/SPARK-15856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> In Spark 2.0, after unifying Datasets and DataFrames, we made two API 
> breaking changes:
> # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of 
> {{DataFrame}}
> # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of 
> {{DataFrame}}
> However, these two changes introduced several inconsistencies and problems:
> # {{spark.read.text()}} silently discards partitioned columns when reading a 
> partitioned table in text format since {{Dataset\[String\]}} only contains a 
> single field. Users have to use {{spark.read.format("text").load()}} to 
> workaround this, which is pretty confusing and error-prone.
> # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} 
> (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}.
> # When applying typed operations over Datasets returned by {{spark.range()}}, 
> weird schema changes may happen. Please refer to SPARK-15632 for more details.
> Due to these reasons, we decided to revert these two changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15881) Update microbenchmark results

2016-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15881.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.0.0

> Update microbenchmark results
> -
>
> Key: SPARK-15881
> URL: https://issues.apache.org/jira/browse/SPARK-15881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15585) Don't use null in data source options to indicate default value

2016-06-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15585.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.0.0

> Don't use null in data source options to indicate default value
> ---
>
> Key: SPARK-15585
> URL: https://issues.apache.org/jira/browse/SPARK-15585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Takeshi Yamamuro
>Priority: Critical
> Fix For: 2.0.0
>
>
> See email: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html
> We'd need to change DataFrameReader/DataFrameWriter in Python's 
> csv/json/parquet/... functions to put the actual default option values as 
> function parameters, rather than setting them to None. We can then in 
> CSVOptions.getChar (and JSONOptions, etc) to actually return null if the 
> value is null, rather  than setting it to default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15881) Update microbenchmark results

2016-06-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15881:
--
Fix Version/s: (was: 2.0.0)

> Update microbenchmark results
> -
>
> Key: SPARK-15881
> URL: https://issues.apache.org/jira/browse/SPARK-15881
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-06-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15639:
--
Fix Version/s: (was: 2.0.0)

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark

2016-06-11 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326036#comment-15326036
 ] 

Josh Rosen commented on SPARK-12661:


Yeah, I think that just messaging that Python 2.6 users should aim to upgrade 
to 2.7+ before Spark 2.1.0 will be sufficient (and maybe print a deprecation 
warning if we detect that we're running on 2.6).

> Drop Python 2.6 support in PySpark
> --
>
> Key: SPARK-12661
> URL: https://issues.apache.org/jira/browse/SPARK-12661
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Reporter: Davies Liu
>  Labels: releasenotes
>
> 1. stop testing with 2.6
> 2. remove the code for python 2.6
> see discussion : 
> https://www.mail-archive.com/user@spark.apache.org/msg43423.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count

2016-06-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15892:
--
Assignee: Hyukjin Kwon

> Incorrectly merged AFTAggregator with zero total count
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count

2016-06-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15892:
--
Summary: Incorrectly merged AFTAggregator with zero total count  (was: 
aft_survival_regression.py example fails in branch-2.0)

> Incorrectly merged AFTAggregator with zero total count
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count

2016-06-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15892:
--
Shepherd: Joseph K. Bradley

> Incorrectly merged AFTAggregator with zero total count
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count

2016-06-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15892:
--
Affects Version/s: 1.6.1

> Incorrectly merged AFTAggregator with zero total count
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15590) Paginate Job Table in Jobs tab

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15590:


Assignee: Apache Spark  (was: Tao Lin)

> Paginate Job Table in Jobs tab
> --
>
> Key: SPARK-15590
> URL: https://issues.apache.org/jira/browse/SPARK-15590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15590) Paginate Job Table in Jobs tab

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15590:


Assignee: Tao Lin  (was: Apache Spark)

> Paginate Job Table in Jobs tab
> --
>
> Key: SPARK-15590
> URL: https://issues.apache.org/jira/browse/SPARK-15590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Tao Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15590) Paginate Job Table in Jobs tab

2016-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325971#comment-15325971
 ] 

Apache Spark commented on SPARK-15590:
--

User 'nblintao' has created a pull request for this issue:
https://github.com/apache/spark/pull/13620

> Paginate Job Table in Jobs tab
> --
>
> Key: SPARK-15590
> URL: https://issues.apache.org/jira/browse/SPARK-15590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Tao Lin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15892) aft_survival_regression.py example fails in branch-2.0

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15892:


Assignee: Apache Spark

> aft_survival_regression.py example fails in branch-2.0
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15892) aft_survival_regression.py example fails in branch-2.0

2016-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325955#comment-15325955
 ] 

Apache Spark commented on SPARK-15892:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/13619

> aft_survival_regression.py example fails in branch-2.0
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15892) aft_survival_regression.py example fails in branch-2.0

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15892:


Assignee: (was: Apache Spark)

> aft_survival_regression.py example fails in branch-2.0
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15883) Fix broken links on MLLIB documentations

2016-06-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15883.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13608
[https://github.com/apache/spark/pull/13608]

> Fix broken links on MLLIB documentations
> 
>
> Key: SPARK-15883
> URL: https://issues.apache.org/jira/browse/SPARK-15883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Dongjoon Hyun
>Priority: Trivial
> Fix For: 2.0.0
>
>
> This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
> this contains some editorial change.
> **Fix broken links**
>   * mllib-data-types.md
>   * mllib-decision-tree.md
>   * mllib-ensembles.md
>   * mllib-feature-extraction.md
>   * mllib-pmml-model-export.md
>   * mllib-statistics.md
> **Fix malformed section header and scala coding style**
>   * mllib-linear-methods.md
> **Replace indirect forward links with direct one**
>   * ml-classification-regression.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15883) Fix broken links on MLLIB documentations

2016-06-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15883:
--
Assignee: Dongjoon Hyun

> Fix broken links on MLLIB documentations
> 
>
> Key: SPARK-15883
> URL: https://issues.apache.org/jira/browse/SPARK-15883
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Trivial
> Fix For: 2.0.0
>
>
> This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
> this contains some editorial change.
> **Fix broken links**
>   * mllib-data-types.md
>   * mllib-decision-tree.md
>   * mllib-ensembles.md
>   * mllib-feature-extraction.md
>   * mllib-pmml-model-export.md
>   * mllib-statistics.md
> **Fix malformed section header and scala coding style**
>   * mllib-linear-methods.md
> **Replace indirect forward links with direct one**
>   * ml-classification-regression.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15796:


Assignee: Apache Spark

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Assignee: Apache Spark
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
> Maybe the old generation size should also be mentioned in configuration.html 
> near spark.memory.fraction.
> * Is it a goal for Spark to support heavy caching with default parameters and 
> without GC breakdown? If so, then better default values are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional comman

[jira] [Commented] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325833#comment-15325833
 ] 

Apache Spark commented on SPARK-15796:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13618

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
> Maybe the old generation size should also be mentioned in configuration.html 
> near spark.memory.fraction.
> * Is it a goal for Spark to support heavy caching with default parameters and 
> without GC breakdown? If so, then better default values are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---

[jira] [Assigned] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config

2016-06-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15796:


Assignee: (was: Apache Spark)

> Reduce spark.memory.fraction default to avoid overrunning old gen in JVM 
> default config
> ---
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
> Maybe the old generation size should also be mentioned in configuration.html 
> near spark.memory.fraction.
> * Is it a goal for Spark to support heavy caching with default parameters and 
> without GC breakdown? If so, then better default values are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@s

[jira] [Updated] (SPARK-15879) Update logo in UI and docs to add "Apache"

2016-06-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15879:
--
Assignee: Sean Owen
Priority: Minor  (was: Major)

> Update logo in UI and docs to add "Apache"
> --
>
> Key: SPARK-15879
> URL: https://issues.apache.org/jira/browse/SPARK-15879
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Web UI
>Reporter: Matei Zaharia
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.0
>
>
> We recently added "Apache" to the Spark logo on the website 
> (http://spark.apache.org/images/spark-logo.eps) to have it be the full 
> project name, and we should do the same in the web UI and docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15879) Update logo in UI and docs to add "Apache"

2016-06-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15879.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13609
[https://github.com/apache/spark/pull/13609]

> Update logo in UI and docs to add "Apache"
> --
>
> Key: SPARK-15879
> URL: https://issues.apache.org/jira/browse/SPARK-15879
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Web UI
>Reporter: Matei Zaharia
> Fix For: 2.0.0
>
>
> We recently added "Apache" to the Spark logo on the website 
> (http://spark.apache.org/images/spark-logo.eps) to have it be the full 
> project name, and we should do the same in the web UI and docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15897) Function Registry should just take in FunctionIdentifier for type safety and avoid duplicating

2016-06-11 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325826#comment-15325826
 ] 

Sandeep Singh commented on SPARK-15897:
---

I'm working on this, will create a PR soon.

> Function Registry should just take in FunctionIdentifier for type safety and 
> avoid duplicating
> --
>
> Key: SPARK-15897
> URL: https://issues.apache.org/jira/browse/SPARK-15897
> Project: Spark
>  Issue Type: Improvement
>Reporter: Sandeep Singh
>Priority: Minor
>  Labels: sql
>
> Function Registry should just take in FunctionIdentifier for type safety and 
> avoid duplicating
> TODOs were added in these PRs
> https://github.com/apache/spark/pull/12713
> (https://github.com/apache/spark/pull/12713/files#diff-b3f9800839b9b9a1df9da9cbfc01adf8R619)
> https://github.com/apache/spark/pull/12198  
> (https://github.com/apache/spark/pull/12198/files#diff-b3f9800839b9b9a1df9da9cbfc01adf8R457)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15897) Function Registry should just take in FunctionIdentifier for type safety and avoid duplicating

2016-06-11 Thread Sandeep Singh (JIRA)
Sandeep Singh created SPARK-15897:
-

 Summary: Function Registry should just take in FunctionIdentifier 
for type safety and avoid duplicating
 Key: SPARK-15897
 URL: https://issues.apache.org/jira/browse/SPARK-15897
 Project: Spark
  Issue Type: Improvement
Reporter: Sandeep Singh
Priority: Minor


Function Registry should just take in FunctionIdentifier for type safety and 
avoid duplicating

TODOs were added in these PRs
https://github.com/apache/spark/pull/12713
(https://github.com/apache/spark/pull/12713/files#diff-b3f9800839b9b9a1df9da9cbfc01adf8R619)

https://github.com/apache/spark/pull/12198  
(https://github.com/apache/spark/pull/12198/files#diff-b3f9800839b9b9a1df9da9cbfc01adf8R457)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-11 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325761#comment-15325761
 ] 

Pete Robbins commented on SPARK-15822:
--

has failed on latest Branch 2.0 and master. Currently using Branch-2.0

commit a790ac5793e1988895341fa878f947b09b275926
Author: yinxusen 
Date:   Wed Jun 8 09:18:04 2016 +0100



> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String

2016-06-11 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325759#comment-15325759
 ] 

Pete Robbins commented on SPARK-15822:
--

The stack trace is taken earlier when I detect that the UTF8String created from 
the corrupt UnsafeRow is created as I'm trying to backtrack to the point of 
corruption. The earlier stacktrace is the npe which occurs later on trying to 
use the corrupt UTF8String.

Dumb question but how do I post the plan?

> segmentation violation in o.a.s.unsafe.types.UTF8String 
> 
>
> Key: SPARK-15822
> URL: https://issues.apache.org/jira/browse/SPARK-15822
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: linux amd64
> openjdk version "1.8.0_91"
> OpenJDK Runtime Environment (build 1.8.0_91-b14)
> OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
>Reporter: Pete Robbins
>Assignee: Herman van Hovell
>Priority: Blocker
>
> Executors fail with segmentation violation while running application with
> spark.memory.offHeap.enabled true
> spark.memory.offHeap.size 512m
> Also now reproduced with 
> spark.memory.offHeap.enabled false
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400
> #
> # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # J 4816 C2 
> org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I
>  (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d]
> {noformat}
> We initially saw this on IBM java on PowerPC box but is recreatable on linux 
> with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the 
> same code point:
> {noformat}
> 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48)
> java.lang.NullPointerException
>   at 
> org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831)
>   at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30)
>   at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664)
>   at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.lang.Thread.run(Thread.java:785)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org