[jira] [Closed] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler
[ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-15874. --- Resolution: Not A Problem > HBase rowkey optimization support for Hbase-Storage-handler > --- > > Key: SPARK-15874 > URL: https://issues.apache.org/jira/browse/SPARK-15874 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Weichen Xu > Original Estimate: 720h > Remaining Estimate: 720h > > Currently, Spark-SQL use `org.apache.hadoop.hive.hbase.HBaseStorageHandler` > for Hbase table support, which has poor optimization. for example, query such > as > select * from hbase_tab1 where rowkey_col = 'abc'; > will cause full table scan(each table region turn into a scan split and do > full region scan). > In fact, it is easy to implement the following optimization: > 1. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc';` > or > `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or > ...;` > can use hbase rowkey `Get`/`multiGet` API to execute efficiently. > 2. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc%';` > can use hbase rowkey `Scan` API to execute efficiently. > Higher-level SQL optimization will benefit from such optimization, for > example, there is a very small table(such as incremental Data) `small_tab1`, > SQL such as > `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = > hbase_tab1.rowkey_col` > can use classic small-table driven join optimization: > loop each record of small_tab1, and exact each small_tab1.key1 as > hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently. > The scenario described above is very common, manay business system may have > several tables which has major-key such as userID, and they often store them > in HBase. But, several times people have requirement to do some analysis with > SQL, and these SQL will have good optimization if the SQL execution plan has > a good support to HBase rowkey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler
[ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326223#comment-15326223 ] Reynold Xin commented on SPARK-15874: - If you want to get fancy you can play with the experimental strategies in SQLContext. Take a look at that. I'm going to mark this ticket as won't fix for now. Thanks! > HBase rowkey optimization support for Hbase-Storage-handler > --- > > Key: SPARK-15874 > URL: https://issues.apache.org/jira/browse/SPARK-15874 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Weichen Xu > Original Estimate: 720h > Remaining Estimate: 720h > > Currently, Spark-SQL use `org.apache.hadoop.hive.hbase.HBaseStorageHandler` > for Hbase table support, which has poor optimization. for example, query such > as > select * from hbase_tab1 where rowkey_col = 'abc'; > will cause full table scan(each table region turn into a scan split and do > full region scan). > In fact, it is easy to implement the following optimization: > 1. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc';` > or > `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or > ...;` > can use hbase rowkey `Get`/`multiGet` API to execute efficiently. > 2. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc%';` > can use hbase rowkey `Scan` API to execute efficiently. > Higher-level SQL optimization will benefit from such optimization, for > example, there is a very small table(such as incremental Data) `small_tab1`, > SQL such as > `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = > hbase_tab1.rowkey_col` > can use classic small-table driven join optimization: > loop each record of small_tab1, and exact each small_tab1.key1 as > hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently. > The scenario described above is very common, manay business system may have > several tables which has major-key such as userID, and they often store them > in HBase. But, several times people have requirement to do some analysis with > SQL, and these SQL will have good optimization if the SQL execution plan has > a good support to HBase rowkey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
[ https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15901: Assignee: Apache Spark > Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET > -- > > Key: SPARK-15901 > URL: https://issues.apache.org/jira/browse/SPARK-15901 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > So far, we do not have test cases for verifying whether the external > parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works > when users use non-default values. Adding test cases for avoiding regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
[ https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326217#comment-15326217 ] Apache Spark commented on SPARK-15901: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/13622 > Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET > -- > > Key: SPARK-15901 > URL: https://issues.apache.org/jira/browse/SPARK-15901 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > So far, we do not have test cases for verifying whether the external > parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works > when users use non-default values. Adding test cases for avoiding regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
[ https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15901: Assignee: (was: Apache Spark) > Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET > -- > > Key: SPARK-15901 > URL: https://issues.apache.org/jira/browse/SPARK-15901 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > So far, we do not have test cases for verifying whether the external > parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works > when users use non-default values. Adding test cases for avoiding regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
Xiao Li created SPARK-15901: --- Summary: Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET Key: SPARK-15901 URL: https://issues.apache.org/jira/browse/SPARK-15901 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li So far, we do not have test cases for verifying whether the external parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works when users use non-default values. Adding test cases for avoiding regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15840) New csv reader does not "determine the input schema"
[ https://issues.apache.org/jira/browse/SPARK-15840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15840. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.0.0 > New csv reader does not "determine the input schema" > > > Key: SPARK-15840 > URL: https://issues.apache.org/jira/browse/SPARK-15840 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0 >Reporter: Ernst Sjöstrand >Assignee: Hyukjin Kwon > Fix For: 2.0.0 > > > When testing the new csv reader I found that it would not determine the input > schema as is stated in the documentation. > (I used this documentation: > https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext > ) > So either there is a bug in the implementation or in the documentation. > This also means that things like dateFormat are ignore it seems like. > Here's a quick test in pyspark (using Python3): > a = spark.read.csv("/home/ernst/test.csv") > a.printSchema() > print(a.dtypes) > a.show() > {noformat} > root > |-- _c0: string (nullable = true) > [('_c0', 'string')] > +---+ > |_c0| > +---+ > | 1| > | 2| > | 3| > | 4| > +---+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15860) Metrics for codegen size and perf
[ https://issues.apache.org/jira/browse/SPARK-15860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15860. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.0.0 > Metrics for codegen size and perf > - > > Key: SPARK-15860 > URL: https://issues.apache.org/jira/browse/SPARK-15860 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.0.0 > > > We should expose codahale metrics for the codegen source text size and how > long it takes to compile. The size is particularly interesting, since the JVM > does have hard limits on how large methods can get. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14736) Deadlock in registering applications while the Master is in the RECOVERING mode
[ https://issues.apache.org/jira/browse/SPARK-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326203#comment-15326203 ] niranda perera commented on SPARK-14736: Hi guys, Any update on this? We are seeing this deadlock in our custom recovery mode impl quite often. Best > Deadlock in registering applications while the Master is in the RECOVERING > mode > --- > > Key: SPARK-14736 > URL: https://issues.apache.org/jira/browse/SPARK-14736 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1, 1.5.0, 1.6.0 > Environment: unix, Spark cluster with a custom > StandaloneRecoveryModeFactory and a custom PersistenceEngine >Reporter: niranda perera >Priority: Critical > > I have encountered the following issue in the standalone recovery mode. > Let's say there was an application A running in the cluster. Due to some > issue, the entire cluster, together with the application A goes down. > Then later on, cluster comes back online, and the master then goes into the > 'recovering' mode, because it sees some apps, workers and drivers have > already been in the cluster from Persistence Engine. While in the recovery > process, the application comes back online, but now it would have a different > ID, let's say B. > But then, as per the master, application registration logic, this application > B will NOT be added to the 'waitingApps' with the message ""Attempted to > re-register application at same address". [1] > private def registerApplication(app: ApplicationInfo): Unit = { > val appAddress = app.driver.address > if (addressToApp.contains(appAddress)) { > logInfo("Attempted to re-register application at same address: " + > appAddress) > return > } > The problem here is, master is trying to recover application A, which is not > in there anymore. Therefore after the recovery process, app A will be > dropped. However app A's successor, app B was also omitted from the > 'waitingApps' list because it had the same address as App A previously. > This creates a deadlock in the cluster, app A nor app B is available in the > cluster. > When the master is in the RECOVERING mode, shouldn't it add all the > registering apps to a list first, and then after the recovery is completed > (once the unsuccessful recoveries are removed), deploy the apps which are new? > This would sort this deadlock IMO? > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15898) DataFrameReader.text should return DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15898: Assignee: Apache Spark (was: Wenchen Fan) > DataFrameReader.text should return DataFrame > > > Key: SPARK-15898 > URL: https://issues.apache.org/jira/browse/SPARK-15898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > See discussion at https://github.com/apache/spark/pull/13604 > We want to maintain API compatibility for DataFrameReader.text, and will > introduce a new API called DataFrameReader.textFile which returns > Dataset[String]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15898) DataFrameReader.text should return DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15898: Assignee: Wenchen Fan (was: Apache Spark) > DataFrameReader.text should return DataFrame > > > Key: SPARK-15898 > URL: https://issues.apache.org/jira/browse/SPARK-15898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Wenchen Fan > > See discussion at https://github.com/apache/spark/pull/13604 > We want to maintain API compatibility for DataFrameReader.text, and will > introduce a new API called DataFrameReader.textFile which returns > Dataset[String]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15898) DataFrameReader.text should return DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326193#comment-15326193 ] Apache Spark commented on SPARK-15898: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13604 > DataFrameReader.text should return DataFrame > > > Key: SPARK-15898 > URL: https://issues.apache.org/jira/browse/SPARK-15898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Wenchen Fan > > See discussion at https://github.com/apache/spark/pull/13604 > We want to maintain API compatibility for DataFrameReader.text, and will > introduce a new API called DataFrameReader.textFile which returns > Dataset[String]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15900) please add a map param on MQTTUtils.createStream for setting MqttConnectOptions
[ https://issues.apache.org/jira/browse/SPARK-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-15900: --- Summary: please add a map param on MQTTUtils.createStream for setting MqttConnectOptions (was: please add a map param on MQTTUtils.createStreamfor setting MqttConnectOptions ) > please add a map param on MQTTUtils.createStream for setting > MqttConnectOptions > > > Key: SPARK-15900 > URL: https://issues.apache.org/jira/browse/SPARK-15900 > Project: Spark > Issue Type: New Feature > Components: Streaming >Affects Versions: 1.6.1 >Reporter: lichenglin > > I notice that MQTTReceiver will create a connection with the method > (org.eclipse.paho.client.mqttv3.MqttClient) client.connect() > It just means client.connect(new MqttConnectOptions()); > this causes that we have to use the default MqttConnectOptions and can't set > other param like usename and password. > please add a new method at MQTTUtils.createStream like > createStream(jssc.ssc, brokerUrl, topic, map,storageLevel) > in order to make a none-default MqttConnectOptions. > thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15900) please add a map param on MQTTUtils.createStreamfor setting MqttConnectOptions
[ https://issues.apache.org/jira/browse/SPARK-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lichenglin updated SPARK-15900: --- Summary: please add a map param on MQTTUtils.createStreamfor setting MqttConnectOptions (was: please add a map param on MQTTUtils.create for setting MqttConnectOptions ) > please add a map param on MQTTUtils.createStreamfor setting > MqttConnectOptions > --- > > Key: SPARK-15900 > URL: https://issues.apache.org/jira/browse/SPARK-15900 > Project: Spark > Issue Type: New Feature > Components: Streaming >Affects Versions: 1.6.1 >Reporter: lichenglin > > I notice that MQTTReceiver will create a connection with the method > (org.eclipse.paho.client.mqttv3.MqttClient) client.connect() > It just means client.connect(new MqttConnectOptions()); > this causes that we have to use the default MqttConnectOptions and can't set > other param like usename and password. > please add a new method at MQTTUtils.createStream like > createStream(jssc.ssc, brokerUrl, topic, map,storageLevel) > in order to make a none-default MqttConnectOptions. > thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15900) please add a map param on MQTTUtils.create for setting MqttConnectOptions
lichenglin created SPARK-15900: -- Summary: please add a map param on MQTTUtils.create for setting MqttConnectOptions Key: SPARK-15900 URL: https://issues.apache.org/jira/browse/SPARK-15900 Project: Spark Issue Type: New Feature Components: Streaming Affects Versions: 1.6.1 Reporter: lichenglin I notice that MQTTReceiver will create a connection with the method (org.eclipse.paho.client.mqttv3.MqttClient) client.connect() It just means client.connect(new MqttConnectOptions()); this causes that we have to use the default MqttConnectOptions and can't set other param like usename and password. please add a new method at MQTTUtils.createStream like createStream(jssc.ssc, brokerUrl, topic, map,storageLevel) in order to make a none-default MqttConnectOptions. thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15857) Add Caller Context in Spark
[ https://issues.apache.org/jira/browse/SPARK-15857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326153#comment-15326153 ] Sun Rui commented on SPARK-15857: - +1 for this feature > Add Caller Context in Spark > --- > > Key: SPARK-15857 > URL: https://issues.apache.org/jira/browse/SPARK-15857 > Project: Spark > Issue Type: New Feature >Reporter: Weiqing Yang > > Hadoop has implemented a feature of log tracing – caller context (Jira: > HDFS-9184 and YARN-4349). The motivation is to better diagnose and understand > how specific applications impacting parts of the Hadoop system and potential > problems they may be creating (e.g. overloading NN). As HDFS mentioned in > HDFS-9184, for a given HDFS operation, it's very helpful to track which upper > level job issues it. The upper level callers may be specific Oozie tasks, MR > jobs, hive queries, Spark jobs. > Hadoop ecosystems like MapReduce, Tez (TEZ-2851), Hive (HIVE-12249, > HIVE-12254) and Pig(PIG-4714) have implemented their caller contexts. Those > systems invoke HDFS client API and Yarn client API to setup caller context, > and also expose an API to pass in caller context into it. > Lots of Spark applications are running on Yarn/HDFS. Spark can also implement > its caller context via invoking HDFS/Yarn API, and also expose an API to its > upstream applications to set up their caller contexts. In the end, the spark > caller context written into Yarn log / HDFS log can associate with task id, > stage id, job id and app id. That is also very good for Spark users to > identify tasks especially if Spark supports multi-tenant environment in the > future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326146#comment-15326146 ] Sun Rui commented on SPARK-15799: - This request has been asked before. The question is that SparkR needs co-work with the Spark distribution with the matching version. I think releasing SparkR on CRAN will promote the adoption of it. So we need find a release model for it. My thought is as follows: 1. Release SparkR R portion as SparkR package on CRAN following the normal R package convention. The package contains the matching Spark version and link for the spark distribution. The package has .onLoad() function. When it is loaded, .onLoad() will check if there is a local spark distribution installed. If not, it will attempt to download the distribution from the link and saving into a proper location. The SparkR CRAN package depends on the Spark distribution for the RBackend, for local mode execution and for remote cluster connection. .onLoad() will set SPARK_HOME if if finds the spark distribution. 2. Add a version check mechanism. So SparkR can check it matches the remote cluster if remote cluster deploying mode is desired. 3. R users don't need special scripts like bin/sparkR or bin/spark-submit for using SparkR. They can just start R, load SparkR library(). or running a SparkR script from the command line. In SparkR.init(), version check is performed and if no match, error message will be displayed. > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler
[ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326127#comment-15326127 ] Weichen Xu commented on SPARK-15874: en...I got it. But there is another problem, if I want to implements a HBase connector, I found it can't be optimized well under current Spark-SQL architecture, for example , the generated execute plan can't pass proper rowkey to the underlying HBase table reader(if I want to implement the small-table driven join optimization describe above...) so if the spark-SQL execution planner can add several support for such optimization techniques(which can take advantage of underlying table with index to boost SQL execution)? > HBase rowkey optimization support for Hbase-Storage-handler > --- > > Key: SPARK-15874 > URL: https://issues.apache.org/jira/browse/SPARK-15874 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Weichen Xu > Original Estimate: 720h > Remaining Estimate: 720h > > Currently, Spark-SQL use `org.apache.hadoop.hive.hbase.HBaseStorageHandler` > for Hbase table support, which has poor optimization. for example, query such > as > select * from hbase_tab1 where rowkey_col = 'abc'; > will cause full table scan(each table region turn into a scan split and do > full region scan). > In fact, it is easy to implement the following optimization: > 1. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc';` > or > `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or > ...;` > can use hbase rowkey `Get`/`multiGet` API to execute efficiently. > 2. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc%';` > can use hbase rowkey `Scan` API to execute efficiently. > Higher-level SQL optimization will benefit from such optimization, for > example, there is a very small table(such as incremental Data) `small_tab1`, > SQL such as > `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = > hbase_tab1.rowkey_col` > can use classic small-table driven join optimization: > loop each record of small_tab1, and exact each small_tab1.key1 as > hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently. > The scenario described above is very common, manay business system may have > several tables which has major-key such as userID, and they often store them > in HBase. But, several times people have requirement to do some analysis with > SQL, and these SQL will have good optimization if the SQL execution plan has > a good support to HBase rowkey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-2623) Stacked Auto Encoder (Deep Learning )
[ https://issues.apache.org/jira/browse/SPARK-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-2623: --- Assignee: Apache Spark (was: Victor Fang) > Stacked Auto Encoder (Deep Learning ) > - > > Key: SPARK-2623 > URL: https://issues.apache.org/jira/browse/SPARK-2623 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Victor Fang >Assignee: Apache Spark > Labels: deeplearning, machine_learning > > We would like to add parallel implementation of Stacked Auto Encoder (Deep > Learning ) algorithm to Spark MLLib. > SAE is one of the most popular Deep Learning algorithms. It has achieved > successful benchmarks in MNIST hand written classifications, Google's > ICML2012 "cat face" paper (http://icml.cc/2012/papers/73.pdf), etc. > Our focus is to leverage the RDD and get the SAE with the following > capability with ease of use for both beginners and advanced researchers: > 1, multi layer SAE deep network training and scoring. > 2, unsupervised feature learning. > 3, supervised learning with multinomial logistic regression (softmax). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2623) Stacked Auto Encoder (Deep Learning )
[ https://issues.apache.org/jira/browse/SPARK-2623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326126#comment-15326126 ] Apache Spark commented on SPARK-2623: - User 'avulanov' has created a pull request for this issue: https://github.com/apache/spark/pull/13621 > Stacked Auto Encoder (Deep Learning ) > - > > Key: SPARK-2623 > URL: https://issues.apache.org/jira/browse/SPARK-2623 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Victor Fang >Assignee: Victor Fang > Labels: deeplearning, machine_learning > > We would like to add parallel implementation of Stacked Auto Encoder (Deep > Learning ) algorithm to Spark MLLib. > SAE is one of the most popular Deep Learning algorithms. It has achieved > successful benchmarks in MNIST hand written classifications, Google's > ICML2012 "cat face" paper (http://icml.cc/2012/papers/73.pdf), etc. > Our focus is to leverage the RDD and get the SAE with the following > capability with ease of use for both beginners and advanced researchers: > 1, multi layer SAE deep network training and scoring. > 2, unsupervised feature learning. > 3, supervised learning with multinomial logistic regression (softmax). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler
[ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326119#comment-15326119 ] Reynold Xin commented on SPARK-15874: - Got it. There are already multiple hbase connectors for Spark SQL outside the Spark project, and that's a good way to evolve the ecosystem. In practice there are a lot of users that are using various key value stores and we can't create built-in connectors for all of them. Definitely feel free to contribute to an existing one or create another one that's better and put it on https://spark-packages.org/. > HBase rowkey optimization support for Hbase-Storage-handler > --- > > Key: SPARK-15874 > URL: https://issues.apache.org/jira/browse/SPARK-15874 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Weichen Xu > Original Estimate: 720h > Remaining Estimate: 720h > > Currently, Spark-SQL use `org.apache.hadoop.hive.hbase.HBaseStorageHandler` > for Hbase table support, which has poor optimization. for example, query such > as > select * from hbase_tab1 where rowkey_col = 'abc'; > will cause full table scan(each table region turn into a scan split and do > full region scan). > In fact, it is easy to implement the following optimization: > 1. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc';` > or > `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or > ...;` > can use hbase rowkey `Get`/`multiGet` API to execute efficiently. > 2. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc%';` > can use hbase rowkey `Scan` API to execute efficiently. > Higher-level SQL optimization will benefit from such optimization, for > example, there is a very small table(such as incremental Data) `small_tab1`, > SQL such as > `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = > hbase_tab1.rowkey_col` > can use classic small-table driven join optimization: > loop each record of small_tab1, and exact each small_tab1.key1 as > hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently. > The scenario described above is very common, manay business system may have > several tables which has major-key such as userID, and they often store them > in HBase. But, several times people have requirement to do some analysis with > SQL, and these SQL will have good optimization if the SQL execution plan has > a good support to HBase rowkey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler
[ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326114#comment-15326114 ] Weichen Xu commented on SPARK-15874: The hbase connector is implements in hive and spark-SQL can use it, for example create a hbase external table in spark-SQL: CREATE EXTERNAL TABLE hbase_tab1 ( rowkey string, f1 map, f2 map, f3 map ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:,f2:,f3:") TBLPROPERTIES ("hbase.table.name" = "tab1"); so...I think spark-SQL can do a more optimized hbase connector replace the one implemented in hive. > HBase rowkey optimization support for Hbase-Storage-handler > --- > > Key: SPARK-15874 > URL: https://issues.apache.org/jira/browse/SPARK-15874 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Weichen Xu > Original Estimate: 720h > Remaining Estimate: 720h > > Currently, Spark-SQL use `org.apache.hadoop.hive.hbase.HBaseStorageHandler` > for Hbase table support, which has poor optimization. for example, query such > as > select * from hbase_tab1 where rowkey_col = 'abc'; > will cause full table scan(each table region turn into a scan split and do > full region scan). > In fact, it is easy to implement the following optimization: > 1. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc';` > or > `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or > ...;` > can use hbase rowkey `Get`/`multiGet` API to execute efficiently. > 2. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc%';` > can use hbase rowkey `Scan` API to execute efficiently. > Higher-level SQL optimization will benefit from such optimization, for > example, there is a very small table(such as incremental Data) `small_tab1`, > SQL such as > `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = > hbase_tab1.rowkey_col` > can use classic small-table driven join optimization: > loop each record of small_tab1, and exact each small_tab1.key1 as > hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently. > The scenario described above is very common, manay business system may have > several tables which has major-key such as userID, and they often store them > in HBase. But, several times people have requirement to do some analysis with > SQL, and these SQL will have good optimization if the SQL execution plan has > a good support to HBase rowkey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package
[ https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326101#comment-15326101 ] Matthias Boehm commented on SPARK-15882: I really like this direction and think it has the potential to become a higher level API for Spark ML, as data frames and data sets have become for Spark SQL. If there is interest, we'd like to help contributing to this feature by porting over a subset of distributed linear algebra operations from SystemML. General Goals: From my perspective, we should aim for an API that hides the underlying data representation (e.g., RDD/Dataset, sparse/dense, blocking configurations, block/row/coordinate, partitioning etc). Furthermore, it would be great to make it easy to swap out the used local matrix library. This approach would allow people to plug in their custom operations (e.g., native BLAS libraries/kernels or compressed block operations), while still relying on a common API and scheme for distributing blocks. RDDs over Datasets: For the internal implementation, I would favor RDDs over Datasets because (1) RDDs allow for more flexibility (e.g., reduceByKey, combineByKey, partitioning-preserving operations), and (2) encoders don't offer much benefit for blocked representations as the per-block overhead is typically negligible. Basic Operations: Initially, I would start with a small well-defined set of operations including matrix multiplications, unary and binary operations (e.g., arithmetic/comparison), unary aggregates (e.g., sum/rowSums/colSums, min/max/mean/sd), reorg operations (transpose/diag/reshape/order), and cumulative aggregates (e.g., cumsum). Towards Optimization: Internally, we could implement alternative operations but hide them under a common interface. For example, matrix multiplication would be exposed as 'multiply' (consistent with local linalg) - internally, however, we would select between alternative operations (see https://github.com/apache/incubator-systemml/blob/master/docs/devdocs/MatrixMultiplicationOperators.txt), based on a simple rule set or user-provided hints as done in Spark SQL. Later, we could think about a more sophisticated optimizer, potentially relying on the existing catalyst infrastructure. What do you think? > Discuss distributed linear algebra in spark.ml package > -- > > Key: SPARK-15882 > URL: https://issues.apache.org/jira/browse/SPARK-15882 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* > should be migrated to org.apache.spark.ml. > Initial questions: > * Should we use Datasets or RDDs underneath? > * If Datasets, are there missing features needed for the migration? > * Do we want to redesign any aspects of the distributed matrices during this > move? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15899) file scheme should be used correctly
Kazuaki Ishizaki created SPARK-15899: Summary: file scheme should be used correctly Key: SPARK-15899 URL: https://issues.apache.org/jira/browse/SPARK-15899 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Kazuaki Ishizaki [A RFC|https://www.ietf.org/rfc/rfc1738.txt] defines file scheme as {{file://host/}} or {{file:///}}. [Wikipedia|https://en.wikipedia.org/wiki/File_URI_scheme] [Some code stuffs|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L58] use different prefix such as {{file:}}. It would be good to prepare a utility method to correctly add {{file://host}} or {{file://} prefix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15874) HBase rowkey optimization support for Hbase-Storage-handler
[ https://issues.apache.org/jira/browse/SPARK-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326069#comment-15326069 ] Reynold Xin commented on SPARK-15874: - I'm confused -- Apache Spark's code base itself does not include a hbase connector. Which one are you referring to? > HBase rowkey optimization support for Hbase-Storage-handler > --- > > Key: SPARK-15874 > URL: https://issues.apache.org/jira/browse/SPARK-15874 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Weichen Xu > Original Estimate: 720h > Remaining Estimate: 720h > > Currently, Spark-SQL use `org.apache.hadoop.hive.hbase.HBaseStorageHandler` > for Hbase table support, which has poor optimization. for example, query such > as > select * from hbase_tab1 where rowkey_col = 'abc'; > will cause full table scan(each table region turn into a scan split and do > full region scan). > In fact, it is easy to implement the following optimization: > 1. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc';` > or > `select * from hbase_tab1 where rowkey_col = 'abc' or rowkey_col = 'abd' or > ...;` > can use hbase rowkey `Get`/`multiGet` API to execute efficiently. > 2. > SQL such as > `select * from hbase_tab1 where rowkey_col = 'abc%';` > can use hbase rowkey `Scan` API to execute efficiently. > Higher-level SQL optimization will benefit from such optimization, for > example, there is a very small table(such as incremental Data) `small_tab1`, > SQL such as > `select * from small_tab1 join hbase_tab1 on small_tab1.key1 = > hbase_tab1.rowkey_col` > can use classic small-table driven join optimization: > loop each record of small_tab1, and exact each small_tab1.key1 as > hbase_tab1's rowkey, and use hbase Get API, the join will execute efficiently. > The scenario described above is very common, manay business system may have > several tables which has major-key such as userID, and they often store them > in HBase. But, several times people have requirement to do some analysis with > SQL, and these SQL will have good optimization if the SQL execution plan has > a good support to HBase rowkey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15807) Support varargs for dropDuplicates in Dataset/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15807. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 > Support varargs for dropDuplicates in Dataset/DataFrame > --- > > Key: SPARK-15807 > URL: https://issues.apache.org/jira/browse/SPARK-15807 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue adds `varargs`-types `dropDuplicates` functions in > `Dataset/DataFrame`. Currently, `dropDuplicates` supports only `Seq` or > `Array`. > {code} > scala> val ds = spark.createDataFrame(Seq(("a", 1), ("b", 2), ("a", 2))) > ds: org.apache.spark.sql.DataFrame = [_1: string, _2: int] > scala> ds.dropDuplicates(Seq("_1", "_2")) > res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: string, > _2: int] > scala> ds.dropDuplicates("_1", "_2") > :26: error: overloaded method value dropDuplicates with alternatives: > (colNames: > Array[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > (colNames: > Seq[String])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > ()org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] > cannot be applied to (String, String) >ds.dropDuplicates("_1", "_2") > ^ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14851) Support radix sort with nullable longs
[ https://issues.apache.org/jira/browse/SPARK-14851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14851. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.1.0 > Support radix sort with nullable longs > -- > > Key: SPARK-14851 > URL: https://issues.apache.org/jira/browse/SPARK-14851 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.1.0 > > > The current radix sort cannot handle nullable longs, since there is no bit > pattern available to represents nulls. These cases are probably best handled > outside the radix sort logic e.g. by keeping nulls in a separate region of > the array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15898) DataFrameReader.text should return DataFrame
Reynold Xin created SPARK-15898: --- Summary: DataFrameReader.text should return DataFrame Key: SPARK-15898 URL: https://issues.apache.org/jira/browse/SPARK-15898 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan See discussion at https://github.com/apache/spark/pull/13604 We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15856) Revert API breaking changes made in SQLContext.range
[ https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326065#comment-15326065 ] Reynold Xin commented on SPARK-15856: - Note that we have decided to only revert the SQLContext.range API in this ticket. > Revert API breaking changes made in SQLContext.range > > > Key: SPARK-15856 > URL: https://issues.apache.org/jira/browse/SPARK-15856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > In Spark 2.0, after unifying Datasets and DataFrames, we made two API > breaking changes: > # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of > {{DataFrame}} > # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of > {{DataFrame}} > However, these two changes introduced several inconsistencies and problems: > # {{spark.read.text()}} silently discards partitioned columns when reading a > partitioned table in text format since {{Dataset\[String\]}} only contains a > single field. Users have to use {{spark.read.format("text").load()}} to > workaround this, which is pretty confusing and error-prone. > # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} > (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}. > # When applying typed operations over Datasets returned by {{spark.range()}}, > weird schema changes may happen. Please refer to SPARK-15632 for more details. > Due to these reasons, we decided to revert these two changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15856) Revert API breaking changes made in SQLContext.range
[ https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15856: Summary: Revert API breaking changes made in SQLContext.range (was: Revert API breaking changes made in DataFrameReader.text and SQLContext.range) > Revert API breaking changes made in SQLContext.range > > > Key: SPARK-15856 > URL: https://issues.apache.org/jira/browse/SPARK-15856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > In Spark 2.0, after unifying Datasets and DataFrames, we made two API > breaking changes: > # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of > {{DataFrame}} > # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of > {{DataFrame}} > However, these two changes introduced several inconsistencies and problems: > # {{spark.read.text()}} silently discards partitioned columns when reading a > partitioned table in text format since {{Dataset\[String\]}} only contains a > single field. Users have to use {{spark.read.format("text").load()}} to > workaround this, which is pretty confusing and error-prone. > # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} > (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}. > # When applying typed operations over Datasets returned by {{spark.range()}}, > weird schema changes may happen. Please refer to SPARK-15632 for more details. > Due to these reasons, we decided to revert these two changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15856) Revert API breaking changes made in SQLContext.range
[ https://issues.apache.org/jira/browse/SPARK-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15856. - Resolution: Fixed Fix Version/s: 2.0.0 > Revert API breaking changes made in SQLContext.range > > > Key: SPARK-15856 > URL: https://issues.apache.org/jira/browse/SPARK-15856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Lian >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > In Spark 2.0, after unifying Datasets and DataFrames, we made two API > breaking changes: > # {{DataFrameReader.text()}} now returns {{Dataset\[String\]}} instead of > {{DataFrame}} > # {{SQLContext.range()}} now returns {{Dataset\[java.lang.Long\]}} instead of > {{DataFrame}} > However, these two changes introduced several inconsistencies and problems: > # {{spark.read.text()}} silently discards partitioned columns when reading a > partitioned table in text format since {{Dataset\[String\]}} only contains a > single field. Users have to use {{spark.read.format("text").load()}} to > workaround this, which is pretty confusing and error-prone. > # All data source shortcut methods in `DataFrameReader` return {{DataFrame}} > (aka {{Dataset\[Row\]}}) except for {{DataFrameReader.text()}}. > # When applying typed operations over Datasets returned by {{spark.range()}}, > weird schema changes may happen. Please refer to SPARK-15632 for more details. > Due to these reasons, we decided to revert these two changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15881) Update microbenchmark results
[ https://issues.apache.org/jira/browse/SPARK-15881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15881. - Resolution: Fixed Assignee: Eric Liang Fix Version/s: 2.0.0 > Update microbenchmark results > - > > Key: SPARK-15881 > URL: https://issues.apache.org/jira/browse/SPARK-15881 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang >Assignee: Eric Liang > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15585) Don't use null in data source options to indicate default value
[ https://issues.apache.org/jira/browse/SPARK-15585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15585. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.0.0 > Don't use null in data source options to indicate default value > --- > > Key: SPARK-15585 > URL: https://issues.apache.org/jira/browse/SPARK-15585 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Takeshi Yamamuro >Priority: Critical > Fix For: 2.0.0 > > > See email: > http://apache-spark-developers-list.1001551.n3.nabble.com/changed-behavior-for-csv-datasource-and-quoting-in-spark-2-0-0-SNAPSHOT-td17704.html > We'd need to change DataFrameReader/DataFrameWriter in Python's > csv/json/parquet/... functions to put the actual default option values as > function parameters, rather than setting them to None. We can then in > CSVOptions.getChar (and JSONOptions, etc) to actually return null if the > value is null, rather than setting it to default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15881) Update microbenchmark results
[ https://issues.apache.org/jira/browse/SPARK-15881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15881: -- Fix Version/s: (was: 2.0.0) > Update microbenchmark results > - > > Key: SPARK-15881 > URL: https://issues.apache.org/jira/browse/SPARK-15881 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader
[ https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15639: -- Fix Version/s: (was: 2.0.0) > Try to push down filter at RowGroups level for parquet reader > - > > Key: SPARK-15639 > URL: https://issues.apache.org/jira/browse/SPARK-15639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > > When we use vecterized parquet reader, although the base reader (i.e., > SpecificParquetRecordReaderBase) will retrieve pushed-down filters for > RowGroups-level filtering, we seems not really set up the filters to be > pushed down. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12661) Drop Python 2.6 support in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326036#comment-15326036 ] Josh Rosen commented on SPARK-12661: Yeah, I think that just messaging that Python 2.6 users should aim to upgrade to 2.7+ before Spark 2.1.0 will be sufficient (and maybe print a deprecation warning if we detect that we're running on 2.6). > Drop Python 2.6 support in PySpark > -- > > Key: SPARK-12661 > URL: https://issues.apache.org/jira/browse/SPARK-12661 > Project: Spark > Issue Type: Task > Components: PySpark >Reporter: Davies Liu > Labels: releasenotes > > 1. stop testing with 2.6 > 2. remove the code for python 2.6 > see discussion : > https://www.mail-archive.com/user@spark.apache.org/msg43423.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15892: -- Assignee: Hyukjin Kwon > Incorrectly merged AFTAggregator with zero total count > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Hyukjin Kwon > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15892: -- Summary: Incorrectly merged AFTAggregator with zero total count (was: aft_survival_regression.py example fails in branch-2.0) > Incorrectly merged AFTAggregator with zero total count > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15892: -- Shepherd: Joseph K. Bradley > Incorrectly merged AFTAggregator with zero total count > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Hyukjin Kwon > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15892: -- Affects Version/s: 1.6.1 > Incorrectly merged AFTAggregator with zero total count > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15590) Paginate Job Table in Jobs tab
[ https://issues.apache.org/jira/browse/SPARK-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15590: Assignee: Apache Spark (was: Tao Lin) > Paginate Job Table in Jobs tab > -- > > Key: SPARK-15590 > URL: https://issues.apache.org/jira/browse/SPARK-15590 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15590) Paginate Job Table in Jobs tab
[ https://issues.apache.org/jira/browse/SPARK-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15590: Assignee: Tao Lin (was: Apache Spark) > Paginate Job Table in Jobs tab > -- > > Key: SPARK-15590 > URL: https://issues.apache.org/jira/browse/SPARK-15590 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Yin Huai >Assignee: Tao Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15590) Paginate Job Table in Jobs tab
[ https://issues.apache.org/jira/browse/SPARK-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325971#comment-15325971 ] Apache Spark commented on SPARK-15590: -- User 'nblintao' has created a pull request for this issue: https://github.com/apache/spark/pull/13620 > Paginate Job Table in Jobs tab > -- > > Key: SPARK-15590 > URL: https://issues.apache.org/jira/browse/SPARK-15590 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Reporter: Yin Huai >Assignee: Tao Lin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15892) aft_survival_regression.py example fails in branch-2.0
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15892: Assignee: Apache Spark > aft_survival_regression.py example fails in branch-2.0 > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15892) aft_survival_regression.py example fails in branch-2.0
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325955#comment-15325955 ] Apache Spark commented on SPARK-15892: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/13619 > aft_survival_regression.py example fails in branch-2.0 > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15892) aft_survival_regression.py example fails in branch-2.0
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15892: Assignee: (was: Apache Spark) > aft_survival_regression.py example fails in branch-2.0 > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15883) Fix broken links on MLLIB documentations
[ https://issues.apache.org/jira/browse/SPARK-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15883. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13608 [https://github.com/apache/spark/pull/13608] > Fix broken links on MLLIB documentations > > > Key: SPARK-15883 > URL: https://issues.apache.org/jira/browse/SPARK-15883 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Dongjoon Hyun >Priority: Trivial > Fix For: 2.0.0 > > > This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, > this contains some editorial change. > **Fix broken links** > * mllib-data-types.md > * mllib-decision-tree.md > * mllib-ensembles.md > * mllib-feature-extraction.md > * mllib-pmml-model-export.md > * mllib-statistics.md > **Fix malformed section header and scala coding style** > * mllib-linear-methods.md > **Replace indirect forward links with direct one** > * ml-classification-regression.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15883) Fix broken links on MLLIB documentations
[ https://issues.apache.org/jira/browse/SPARK-15883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15883: -- Assignee: Dongjoon Hyun > Fix broken links on MLLIB documentations > > > Key: SPARK-15883 > URL: https://issues.apache.org/jira/browse/SPARK-15883 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Trivial > Fix For: 2.0.0 > > > This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, > this contains some editorial change. > **Fix broken links** > * mllib-data-types.md > * mllib-decision-tree.md > * mllib-ensembles.md > * mllib-feature-extraction.md > * mllib-pmml-model-export.md > * mllib-statistics.md > **Fix malformed section header and scala coding style** > * mllib-linear-methods.md > **Replace indirect forward links with direct one** > * ml-classification-regression.md -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15796: Assignee: Apache Spark > Reduce spark.memory.fraction default to avoid overrunning old gen in JVM > default config > --- > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Assignee: Apache Spark >Priority: Minor > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's cache. The real upper limit is > spark.memory.fraction. > To sum up my questions/issues: > * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. > Maybe the old generation size should also be mentioned in configuration.html > near spark.memory.fraction. > * Is it a goal for Spark to support heavy caching with default parameters and > without GC breakdown? If so, then better default values are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional comman
[jira] [Commented] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325833#comment-15325833 ] Apache Spark commented on SPARK-15796: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/13618 > Reduce spark.memory.fraction default to avoid overrunning old gen in JVM > default config > --- > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Minor > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's cache. The real upper limit is > spark.memory.fraction. > To sum up my questions/issues: > * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. > Maybe the old generation size should also be mentioned in configuration.html > near spark.memory.fraction. > * Is it a goal for Spark to support heavy caching with default parameters and > without GC breakdown? If so, then better default values are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) ---
[jira] [Assigned] (SPARK-15796) Reduce spark.memory.fraction default to avoid overrunning old gen in JVM default config
[ https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15796: Assignee: (was: Apache Spark) > Reduce spark.memory.fraction default to avoid overrunning old gen in JVM > default config > --- > > Key: SPARK-15796 > URL: https://issues.apache.org/jira/browse/SPARK-15796 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gabor Feher >Priority: Minor > > While debugging performance issues in a Spark program, I've found a simple > way to slow down Spark 1.6 significantly by filling the RDD memory cache. > This seems to be a regression, because setting > "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is > just a simple program that fills the memory cache of Spark using a > MEMORY_ONLY cached RDD (but of course this comes up in more complex > situations, too): > {code} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.storage.StorageLevel > object CacheDemoApp { > def main(args: Array[String]) { > val conf = new SparkConf().setAppName("Cache Demo Application") > > val sc = new SparkContext(conf) > val startTime = System.currentTimeMillis() > > > val cacheFiller = sc.parallelize(1 to 5, 1000) > > .mapPartitionsWithIndex { > case (ix, it) => > println(s"CREATE DATA PARTITION ${ix}") > > val r = new scala.util.Random(ix) > it.map(x => (r.nextLong, r.nextLong)) > } > cacheFiller.persist(StorageLevel.MEMORY_ONLY) > cacheFiller.foreach(identity) > val finishTime = System.currentTimeMillis() > val elapsedTime = (finishTime - startTime) / 1000 > println(s"TIME= $elapsedTime s") > } > } > {code} > If I call it the following way, it completes in around 5 minutes on my > Laptop, while often stopping for slow Full GC cycles. I can also see with > jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled. > {code} > sbt package > ~/spark-1.6.0/bin/spark-submit \ > --class "CacheDemoApp" \ > --master "local[2]" \ > --driver-memory 3g \ > --driver-java-options "-XX:+PrintGCDetails" \ > target/scala-2.10/simple-project_2.10-1.0.jar > {code} > If I add any one of the below flags, then the run-time drops to around 40-50 > seconds and the difference is coming from the drop in GC times: > --conf "spark.memory.fraction=0.6" > OR > --conf "spark.memory.useLegacyMode=true" > OR > --driver-java-options "-XX:NewRatio=3" > All the other cache types except for DISK_ONLY produce similar symptoms. It > looks like that the problem is that the amount of data Spark wants to store > long-term ends up being larger than the old generation size in the JVM and > this triggers Full GC repeatedly. > I did some research: > * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It > defaults to 0.75. > * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache > size. It defaults to 0.6 and... > * http://spark.apache.org/docs/1.5.2/configuration.html even says that it > shouldn't be bigger than the size of the old generation. > * On the other hand, OpenJDK's default NewRatio is 2, which means an old > generation size of 66%. Hence the default value in Spark 1.6 contradicts this > advice. > http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old > generation is running close to full, then setting > spark.memory.storageFraction to a lower value should help. I have tried with > spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is > not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html > explains that storageFraction is not an upper-limit but a lower limit-like > thing on the size of Spark's cache. The real upper limit is > spark.memory.fraction. > To sum up my questions/issues: > * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. > Maybe the old generation size should also be mentioned in configuration.html > near spark.memory.fraction. > * Is it a goal for Spark to support heavy caching with default parameters and > without GC breakdown? If so, then better default values are needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@s
[jira] [Updated] (SPARK-15879) Update logo in UI and docs to add "Apache"
[ https://issues.apache.org/jira/browse/SPARK-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-15879: -- Assignee: Sean Owen Priority: Minor (was: Major) > Update logo in UI and docs to add "Apache" > -- > > Key: SPARK-15879 > URL: https://issues.apache.org/jira/browse/SPARK-15879 > Project: Spark > Issue Type: Task > Components: Documentation, Web UI >Reporter: Matei Zaharia >Assignee: Sean Owen >Priority: Minor > Fix For: 2.0.0 > > > We recently added "Apache" to the Spark logo on the website > (http://spark.apache.org/images/spark-logo.eps) to have it be the full > project name, and we should do the same in the web UI and docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15879) Update logo in UI and docs to add "Apache"
[ https://issues.apache.org/jira/browse/SPARK-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15879. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13609 [https://github.com/apache/spark/pull/13609] > Update logo in UI and docs to add "Apache" > -- > > Key: SPARK-15879 > URL: https://issues.apache.org/jira/browse/SPARK-15879 > Project: Spark > Issue Type: Task > Components: Documentation, Web UI >Reporter: Matei Zaharia > Fix For: 2.0.0 > > > We recently added "Apache" to the Spark logo on the website > (http://spark.apache.org/images/spark-logo.eps) to have it be the full > project name, and we should do the same in the web UI and docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15897) Function Registry should just take in FunctionIdentifier for type safety and avoid duplicating
[ https://issues.apache.org/jira/browse/SPARK-15897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325826#comment-15325826 ] Sandeep Singh commented on SPARK-15897: --- I'm working on this, will create a PR soon. > Function Registry should just take in FunctionIdentifier for type safety and > avoid duplicating > -- > > Key: SPARK-15897 > URL: https://issues.apache.org/jira/browse/SPARK-15897 > Project: Spark > Issue Type: Improvement >Reporter: Sandeep Singh >Priority: Minor > Labels: sql > > Function Registry should just take in FunctionIdentifier for type safety and > avoid duplicating > TODOs were added in these PRs > https://github.com/apache/spark/pull/12713 > (https://github.com/apache/spark/pull/12713/files#diff-b3f9800839b9b9a1df9da9cbfc01adf8R619) > https://github.com/apache/spark/pull/12198 > (https://github.com/apache/spark/pull/12198/files#diff-b3f9800839b9b9a1df9da9cbfc01adf8R457) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15897) Function Registry should just take in FunctionIdentifier for type safety and avoid duplicating
Sandeep Singh created SPARK-15897: - Summary: Function Registry should just take in FunctionIdentifier for type safety and avoid duplicating Key: SPARK-15897 URL: https://issues.apache.org/jira/browse/SPARK-15897 Project: Spark Issue Type: Improvement Reporter: Sandeep Singh Priority: Minor Function Registry should just take in FunctionIdentifier for type safety and avoid duplicating TODOs were added in these PRs https://github.com/apache/spark/pull/12713 (https://github.com/apache/spark/pull/12713/files#diff-b3f9800839b9b9a1df9da9cbfc01adf8R619) https://github.com/apache/spark/pull/12198 (https://github.com/apache/spark/pull/12198/files#diff-b3f9800839b9b9a1df9da9cbfc01adf8R457) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325761#comment-15325761 ] Pete Robbins commented on SPARK-15822: -- has failed on latest Branch 2.0 and master. Currently using Branch-2.0 commit a790ac5793e1988895341fa878f947b09b275926 Author: yinxusen Date: Wed Jun 8 09:18:04 2016 +0100 > segmentation violation in o.a.s.unsafe.types.UTF8String > > > Key: SPARK-15822 > URL: https://issues.apache.org/jira/browse/SPARK-15822 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: linux amd64 > openjdk version "1.8.0_91" > OpenJDK Runtime Environment (build 1.8.0_91-b14) > OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode) >Reporter: Pete Robbins >Assignee: Herman van Hovell >Priority: Blocker > > Executors fail with segmentation violation while running application with > spark.memory.offHeap.enabled true > spark.memory.offHeap.size 512m > Also now reproduced with > spark.memory.offHeap.enabled false > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 > # > # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 4816 C2 > org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I > (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] > {noformat} > We initially saw this on IBM java on PowerPC box but is recreatable on linux > with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the > same code point: > {noformat} > 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) > java.lang.NullPointerException > at > org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) > at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) > at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664) > at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15822) segmentation violation in o.a.s.unsafe.types.UTF8String
[ https://issues.apache.org/jira/browse/SPARK-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15325759#comment-15325759 ] Pete Robbins commented on SPARK-15822: -- The stack trace is taken earlier when I detect that the UTF8String created from the corrupt UnsafeRow is created as I'm trying to backtrack to the point of corruption. The earlier stacktrace is the npe which occurs later on trying to use the corrupt UTF8String. Dumb question but how do I post the plan? > segmentation violation in o.a.s.unsafe.types.UTF8String > > > Key: SPARK-15822 > URL: https://issues.apache.org/jira/browse/SPARK-15822 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: linux amd64 > openjdk version "1.8.0_91" > OpenJDK Runtime Environment (build 1.8.0_91-b14) > OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode) >Reporter: Pete Robbins >Assignee: Herman van Hovell >Priority: Blocker > > Executors fail with segmentation violation while running application with > spark.memory.offHeap.enabled true > spark.memory.offHeap.size 512m > Also now reproduced with > spark.memory.offHeap.enabled false > {noformat} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x7f4559b4d4bd, pid=14182, tid=139935319750400 > # > # JRE version: OpenJDK Runtime Environment (8.0_91-b14) (build 1.8.0_91-b14) > # Java VM: OpenJDK 64-Bit Server VM (25.91-b14 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # J 4816 C2 > org.apache.spark.unsafe.types.UTF8String.compareTo(Lorg/apache/spark/unsafe/types/UTF8String;)I > (64 bytes) @ 0x7f4559b4d4bd [0x7f4559b4d460+0x5d] > {noformat} > We initially saw this on IBM java on PowerPC box but is recreatable on linux > with OpenJDK. On linux with IBM Java 8 we see a null pointer exception at the > same code point: > {noformat} > 16/06/08 11:14:58 ERROR Executor: Exception in task 1.0 in stage 5.0 (TID 48) > java.lang.NullPointerException > at > org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:831) > at org.apache.spark.unsafe.types.UTF8String.compare(UTF8String.java:844) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.findNextInnerJoinRows$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$doExecute$2$$anon$2.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:30) > at org.spark_project.guava.collect.Ordering.leastOf(Ordering.java:664) > at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1365) > at > org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1$$anonfun$30.apply(RDD.scala:1362) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:757) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org