[jira] [Created] (SPARK-13671) Use different physical plan for existing RDD and data sources
Davies Liu created SPARK-13671: -- Summary: Use different physical plan for existing RDD and data sources Key: SPARK-13671 URL: https://issues.apache.org/jira/browse/SPARK-13671 Project: Spark Issue Type: Task Components: SQL Reporter: Davies Liu Assignee: Davies Liu Right now, we use PhysicalRDD for both existing RDD and data sources, they are becoming much different, we should use different physical plans for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10610) Using AppName instead of AppId in the name of all metrics
[ https://issues.apache.org/jira/browse/SPARK-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179507#comment-15179507 ] Pete Robbins edited comment on SPARK-10610 at 3/4/16 8:01 AM: -- I think the appId is an important piece of information when visualizing the metrics along with hostname, executorId etc. I'm writing a sink and reporter to push the metrics to Elasticsearch and I include these in the metrics types for better correlation. eg { "timestamp": "2016-03-03T15:58:31.903+", "hostName": "9.20.187.127" "applicationId": "app-20160303155742-0005", "executorId": "driver", "BlockManager_memory_maxMem_MB": 3933 } The appId and executorId I extract form the metric name. When the sink is instantiated I don't believe I have access to any Utils to obtain the current appId and executorId so I'm kind of relying on these being in the metric name for the moment. Is it possible to make appId, applicationName, executorId avaiable to me via some Utils function that I have access to in a metrics Sink? I guess I'm asking: How can I get hold of the SparkConf if I've not been passed it? was (Author: robbinspg): I think the appId is an important piece of information when visualizing the metrics along with hostname, executorId etc. I'm writing a sink and reporter to push the metrics to Elasticsearch and I include these in the metrics types for better correlation. eg { "timestamp": "2016-03-03T15:58:31.903+", "hostName": "9.20.187.127" "applicationId": "app-20160303155742-0005", "executorId": "driver", "BlockManager_memory_maxMem_MB": 3933 } The appId and executorId I extract form the metric name. When the sink is instantiated I don't believe I have access to any Utils to obtain the current appId and executorId so I'm kind of relying on these being in the metric name for the moment. Is it possible to make appId, applicationName, executorId avaiable to me via some Utils function that I have access to in a metrics Sink? > Using AppName instead of AppId in the name of all metrics > - > > Key: SPARK-10610 > URL: https://issues.apache.org/jira/browse/SPARK-10610 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Yi Tian >Priority: Minor > > When we using {{JMX}} to monitor spark system, We have to configure the name > of target metrics in the monitor system. But the current name of metrics is > {{appId}} + {{executorId}} + {{source}} . So when the spark program > restarted, we have to update the name of metrics in the monitor system. > We should add an optional configuration property to control whether using the > appName instead of appId in spark metrics system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13671) Use different physical plan for existing RDD and data sources
[ https://issues.apache.org/jira/browse/SPARK-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179538#comment-15179538 ] Apache Spark commented on SPARK-13671: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11514 > Use different physical plan for existing RDD and data sources > - > > Key: SPARK-13671 > URL: https://issues.apache.org/jira/browse/SPARK-13671 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we use PhysicalRDD for both existing RDD and data sources, they > are becoming much different, we should use different physical plans for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13671) Use different physical plan for existing RDD and data sources
[ https://issues.apache.org/jira/browse/SPARK-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13671: Assignee: Apache Spark (was: Davies Liu) > Use different physical plan for existing RDD and data sources > - > > Key: SPARK-13671 > URL: https://issues.apache.org/jira/browse/SPARK-13671 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > Right now, we use PhysicalRDD for both existing RDD and data sources, they > are becoming much different, we should use different physical plans for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13671) Use different physical plan for existing RDD and data sources
[ https://issues.apache.org/jira/browse/SPARK-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13671: Assignee: Davies Liu (was: Apache Spark) > Use different physical plan for existing RDD and data sources > - > > Key: SPARK-13671 > URL: https://issues.apache.org/jira/browse/SPARK-13671 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Right now, we use PhysicalRDD for both existing RDD and data sources, they > are becoming much different, we should use different physical plans for them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13603) SQL generation for subquery
[ https://issues.apache.org/jira/browse/SPARK-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-13603. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11453 [https://github.com/apache/spark/pull/11453] > SQL generation for subquery > --- > > Key: SPARK-13603 > URL: https://issues.apache.org/jira/browse/SPARK-13603 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > > Generate SQL for subquery expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB
zhengruifeng created SPARK-13672: Summary: Add python examples of BisectingKMeans in ML and MLLIB Key: SPARK-13672 URL: https://issues.apache.org/jira/browse/SPARK-13672 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: zhengruifeng Priority: Trivial add the missing python examples of BisectingKMeans for ml and mllib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13672: Assignee: (was: Apache Spark) > Add python examples of BisectingKMeans in ML and MLLIB > -- > > Key: SPARK-13672 > URL: https://issues.apache.org/jira/browse/SPARK-13672 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: zhengruifeng >Priority: Trivial > > add the missing python examples of BisectingKMeans for ml and mllib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13672: Assignee: Apache Spark > Add python examples of BisectingKMeans in ML and MLLIB > -- > > Key: SPARK-13672 > URL: https://issues.apache.org/jira/browse/SPARK-13672 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Trivial > > add the missing python examples of BisectingKMeans for ml and mllib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179555#comment-15179555 ] Apache Spark commented on SPARK-13672: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/11515 > Add python examples of BisectingKMeans in ML and MLLIB > -- > > Key: SPARK-13672 > URL: https://issues.apache.org/jira/browse/SPARK-13672 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: zhengruifeng >Priority: Trivial > > add the missing python examples of BisectingKMeans for ml and mllib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.
Masayoshi TSUZUKI created SPARK-13673: - Summary: script bin\beeline.cmd pollutes environment variables in Windows. Key: SPARK-13673 URL: https://issues.apache.org/jira/browse/SPARK-13673 Project: Spark Issue Type: Improvement Components: Windows Affects Versions: 1.6.0 Environment: Windows 8.1 Reporter: Masayoshi TSUZUKI Priority: Minor {{bin\beeline.cmd}} pollutes environment variables in Windows. The similar problem is reported and fixed in [SPARK-3943], but {{bin\beeline.cmd}} is added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13629) Add binary toggle Param to CountVectorizer
[ https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179571#comment-15179571 ] Nick Pentreath commented on SPARK-13629: Only the word count would be set to 1 (for non-zero count). > Add binary toggle Param to CountVectorizer > -- > > Key: SPARK-13629 > URL: https://issues.apache.org/jira/browse/SPARK-13629 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > It would be handy to add a binary toggle Param to CountVectorizer, as in the > scikit-learn one: > [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html] > If set, then all non-zero counts will be set to 1. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.
[ https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13673: Assignee: (was: Apache Spark) > script bin\beeline.cmd pollutes environment variables in Windows. > - > > Key: SPARK-13673 > URL: https://issues.apache.org/jira/browse/SPARK-13673 > Project: Spark > Issue Type: Improvement > Components: Windows >Affects Versions: 1.6.0 > Environment: Windows 8.1 >Reporter: Masayoshi TSUZUKI >Priority: Minor > > {{bin\beeline.cmd}} pollutes environment variables in Windows. > The similar problem is reported and fixed in [SPARK-3943], but > {{bin\beeline.cmd}} is added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.
[ https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179578#comment-15179578 ] Apache Spark commented on SPARK-13673: -- User 'tsudukim' has created a pull request for this issue: https://github.com/apache/spark/pull/11516 > script bin\beeline.cmd pollutes environment variables in Windows. > - > > Key: SPARK-13673 > URL: https://issues.apache.org/jira/browse/SPARK-13673 > Project: Spark > Issue Type: Improvement > Components: Windows >Affects Versions: 1.6.0 > Environment: Windows 8.1 >Reporter: Masayoshi TSUZUKI >Priority: Minor > > {{bin\beeline.cmd}} pollutes environment variables in Windows. > The similar problem is reported and fixed in [SPARK-3943], but > {{bin\beeline.cmd}} is added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.
[ https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13673: Assignee: Apache Spark > script bin\beeline.cmd pollutes environment variables in Windows. > - > > Key: SPARK-13673 > URL: https://issues.apache.org/jira/browse/SPARK-13673 > Project: Spark > Issue Type: Improvement > Components: Windows >Affects Versions: 1.6.0 > Environment: Windows 8.1 >Reporter: Masayoshi TSUZUKI >Assignee: Apache Spark >Priority: Minor > > {{bin\beeline.cmd}} pollutes environment variables in Windows. > The similar problem is reported and fixed in [SPARK-3943], but > {{bin\beeline.cmd}} is added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13652) TransportClient.sendRpcSync returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangyu closed SPARK-13652. --- This issue has been fixed by Shixiong Zhu > TransportClient.sendRpcSync returns wrong results > - > > Key: SPARK-13652 > URL: https://issues.apache.org/jira/browse/SPARK-13652 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: huangyu >Assignee: Shixiong Zhu > Fix For: 1.6.2, 2.0.0 > > Attachments: RankHandler.java, Test.java > > > TransportClient is not thread safe and if it is called from multiple threads, > the messages can't be encoded and decoded correctly. Below is my code,and it > will print wrong message. > {code} > public static void main(String[] args) throws IOException, > InterruptedException { > TransportServer server = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > RankHandler()). > createServer(8081, new > LinkedList()); > TransportContext context = new TransportContext(new > TransportConf("test", > new MapConfigProvider(new HashMap())), new > NoOpRpcHandler(), true); > final TransportClientFactory clientFactory = > context.createClientFactory(); > List ts = new ArrayList<>(); > for (int i = 0; i < 10; i++) { > ts.add(new Thread(new Runnable() { > @Override > public void run() { > for (int j = 0; j < 1000; j++) { > try { > ByteBuf buf = Unpooled.buffer(8); > buf.writeLong((long) j); > ByteBuffer byteBuffer = > clientFactory.createClient("localhost", 8081). > sendRpcSync(buf.nioBuffer(), > Long.MAX_VALUE); > long response = byteBuffer.getLong(); > if (response != j) { > System.err.println("send:" + j + ",response:" > + response); > } > } catch (IOException e) { > e.printStackTrace(); > } > } > } > })); > ts.get(i).start(); > } > for (Thread t : ts) { > t.join(); > } > server.close(); > } > public class RankHandler extends RpcHandler { > private final Logger logger = LoggerFactory.getLogger(RankHandler.class); > private final StreamManager streamManager; > public RankHandler() { > this.streamManager = new OneForOneStreamManager(); > } > @Override > public void receive(TransportClient client, ByteBuffer msg, > RpcResponseCallback callback) { > callback.onSuccess(msg); > } > @Override > public StreamManager getStreamManager() { > return streamManager; > } > } > {code} > it will print as below > send:221,response:222 > send:233,response:234 > send:312,response:313 > send:358,response:359 > ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1
[ https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179593#comment-15179593 ] Sean Owen commented on SPARK-13663: --- OK to update for master/1.6 > Upgrade Snappy Java to 1.1.2.1 > -- > > Key: SPARK-13663 > URL: https://issues.apache.org/jira/browse/SPARK-13663 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Priority: Minor > > The JVM memory leaky problem reported in > https://github.com/xerial/snappy-java/issues/131 has been resolved. > 1.1.2.1 was released on Jan 22nd. > We should upgrade to this release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13674) Add wholestage codegen support to Sample
Liang-Chi Hsieh created SPARK-13674: --- Summary: Add wholestage codegen support to Sample Key: SPARK-13674 URL: https://issues.apache.org/jira/browse/SPARK-13674 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Sample operator doesn't support wholestage codegen now. This issue is opened to add support for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13674) Add wholestage codegen support to Sample
[ https://issues.apache.org/jira/browse/SPARK-13674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13674: Assignee: (was: Apache Spark) > Add wholestage codegen support to Sample > > > Key: SPARK-13674 > URL: https://issues.apache.org/jira/browse/SPARK-13674 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Sample operator doesn't support wholestage codegen now. This issue is opened > to add support for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13674) Add wholestage codegen support to Sample
[ https://issues.apache.org/jira/browse/SPARK-13674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13674: Assignee: Apache Spark > Add wholestage codegen support to Sample > > > Key: SPARK-13674 > URL: https://issues.apache.org/jira/browse/SPARK-13674 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > Sample operator doesn't support wholestage codegen now. This issue is opened > to add support for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13674) Add wholestage codegen support to Sample
[ https://issues.apache.org/jira/browse/SPARK-13674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179639#comment-15179639 ] Apache Spark commented on SPARK-13674: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/11517 > Add wholestage codegen support to Sample > > > Key: SPARK-13674 > URL: https://issues.apache.org/jira/browse/SPARK-13674 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > > Sample operator doesn't support wholestage codegen now. This issue is opened > to add support for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13646) QuantileDiscretizer counts dataset twice in getSampledInput
[ https://issues.apache.org/jira/browse/SPARK-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13646. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11491 [https://github.com/apache/spark/pull/11491] > QuantileDiscretizer counts dataset twice in getSampledInput > --- > > Key: SPARK-13646 > URL: https://issues.apache.org/jira/browse/SPARK-13646 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Abou Haydar Elias >Priority: Trivial > Labels: patch, performance > Fix For: 2.0.0 > > > getSampledInput counts the dataset twice as you see here : > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala#L116] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13646) QuantileDiscretizer counts dataset twice in getSampledInput
[ https://issues.apache.org/jira/browse/SPARK-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13646: -- Assignee: Abou Haydar Elias > QuantileDiscretizer counts dataset twice in getSampledInput > --- > > Key: SPARK-13646 > URL: https://issues.apache.org/jira/browse/SPARK-13646 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Abou Haydar Elias >Assignee: Abou Haydar Elias >Priority: Trivial > Labels: patch, performance > Fix For: 2.0.0 > > > getSampledInput counts the dataset twice as you see here : > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala#L116] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode
Saisai Shao created SPARK-13675: --- Summary: The url link in historypage is not correct for application running in yarn cluster mode Key: SPARK-13675 URL: https://issues.apache.org/jira/browse/SPARK-13675 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Saisai Shao Current URL for each application to access history UI is like: http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or http://localhost:18080/history/application_1457058760338_0016/2/jobs/ Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it will parse to attempt id in {{HistoryServer}}, while the correct attempt id should be like "appattempt_1457058760338_0016_02", so it will failed to parse to a correct attempt id in {{HistoryServer}}. This is OK in yarn client mode, since we don't need this attempt id to fetch out the app cache, but it is failed in yarn cluster mode, where attempt id "1" or "2" is actually wrong. So here we should fix this url to parse the correct application id and attempt id. This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-13675: Description: Current URL for each application to access history UI is like: http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or http://localhost:18080/history/application_1457058760338_0016/2/jobs/ Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it will parse to attempt id in {{HistoryServer}}, while the correct attempt id should be like "appattempt_1457058760338_0016_02", so it will fail to parse to a correct attempt id in {{HistoryServer}}. This is OK in yarn client mode, since we don't need this attempt id to fetch out the app cache, but it is failed in yarn cluster mode, where attempt id "1" or "2" is actually wrong. So here we should fix this url to parse the correct application id and attempt id. This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. was: Current URL for each application to access history UI is like: http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or http://localhost:18080/history/application_1457058760338_0016/2/jobs/ Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it will parse to attempt id in {{HistoryServer}}, while the correct attempt id should be like "appattempt_1457058760338_0016_02", so it will failed to parse to a correct attempt id in {{HistoryServer}}. This is OK in yarn client mode, since we don't need this attempt id to fetch out the app cache, but it is failed in yarn cluster mode, where attempt id "1" or "2" is actually wrong. So here we should fix this url to parse the correct application id and attempt id. This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. > The url link in historypage is not correct for application running in yarn > cluster mode > --- > > Key: SPARK-13675 > URL: https://issues.apache.org/jira/browse/SPARK-13675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao > > Current URL for each application to access history UI is like: > http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or > http://localhost:18080/history/application_1457058760338_0016/2/jobs/ > Here *1* or *2* represents the number of attempts in {{historypage.js}}, but > it will parse to attempt id in {{HistoryServer}}, while the correct attempt > id should be like "appattempt_1457058760338_0016_02", so it will fail to > parse to a correct attempt id in {{HistoryServer}}. > This is OK in yarn client mode, since we don't need this attempt id to fetch > out the app cache, but it is failed in yarn cluster mode, where attempt id > "1" or "2" is actually wrong. > So here we should fix this url to parse the correct application id and > attempt id. > This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-13675: Description: Current URL for each application to access history UI is like: http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or http://localhost:18080/history/application_1457058760338_0016/2/jobs/ Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it will parse to attempt id in {{HistoryServer}}, while the correct attempt id should be like "appattempt_1457058760338_0016_02", so it will fail to parse to a correct attempt id in {{HistoryServer}}. This is OK in yarn client mode, since we don't need this attempt id to fetch out the app cache, but it is failed in yarn cluster mode, where attempt id "1" or "2" is actually wrong. So here we should fix this url to parse the correct application id and attempt id. This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. Here is the screenshot: !https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png! was: Current URL for each application to access history UI is like: http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or http://localhost:18080/history/application_1457058760338_0016/2/jobs/ Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it will parse to attempt id in {{HistoryServer}}, while the correct attempt id should be like "appattempt_1457058760338_0016_02", so it will fail to parse to a correct attempt id in {{HistoryServer}}. This is OK in yarn client mode, since we don't need this attempt id to fetch out the app cache, but it is failed in yarn cluster mode, where attempt id "1" or "2" is actually wrong. So here we should fix this url to parse the correct application id and attempt id. This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. > The url link in historypage is not correct for application running in yarn > cluster mode > --- > > Key: SPARK-13675 > URL: https://issues.apache.org/jira/browse/SPARK-13675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao > Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png > > > Current URL for each application to access history UI is like: > http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or > http://localhost:18080/history/application_1457058760338_0016/2/jobs/ > Here *1* or *2* represents the number of attempts in {{historypage.js}}, but > it will parse to attempt id in {{HistoryServer}}, while the correct attempt > id should be like "appattempt_1457058760338_0016_02", so it will fail to > parse to a correct attempt id in {{HistoryServer}}. > This is OK in yarn client mode, since we don't need this attempt id to fetch > out the app cache, but it is failed in yarn cluster mode, where attempt id > "1" or "2" is actually wrong. > So here we should fix this url to parse the correct application id and > attempt id. > This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. > Here is the screenshot: > !https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-13675: Attachment: Screen Shot 2016-02-29 at 3.57.32 PM.png > The url link in historypage is not correct for application running in yarn > cluster mode > --- > > Key: SPARK-13675 > URL: https://issues.apache.org/jira/browse/SPARK-13675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao > Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png > > > Current URL for each application to access history UI is like: > http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or > http://localhost:18080/history/application_1457058760338_0016/2/jobs/ > Here *1* or *2* represents the number of attempts in {{historypage.js}}, but > it will parse to attempt id in {{HistoryServer}}, while the correct attempt > id should be like "appattempt_1457058760338_0016_02", so it will fail to > parse to a correct attempt id in {{HistoryServer}}. > This is OK in yarn client mode, since we don't need this attempt id to fetch > out the app cache, but it is failed in yarn cluster mode, where attempt id > "1" or "2" is actually wrong. > So here we should fix this url to parse the correct application id and > attempt id. > This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13675: Assignee: (was: Apache Spark) > The url link in historypage is not correct for application running in yarn > cluster mode > --- > > Key: SPARK-13675 > URL: https://issues.apache.org/jira/browse/SPARK-13675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao > Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png > > > Current URL for each application to access history UI is like: > http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or > http://localhost:18080/history/application_1457058760338_0016/2/jobs/ > Here *1* or *2* represents the number of attempts in {{historypage.js}}, but > it will parse to attempt id in {{HistoryServer}}, while the correct attempt > id should be like "appattempt_1457058760338_0016_02", so it will fail to > parse to a correct attempt id in {{HistoryServer}}. > This is OK in yarn client mode, since we don't need this attempt id to fetch > out the app cache, but it is failed in yarn cluster mode, where attempt id > "1" or "2" is actually wrong. > So here we should fix this url to parse the correct application id and > attempt id. > This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. > Here is the screenshot: > !https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13675: Assignee: Apache Spark > The url link in historypage is not correct for application running in yarn > cluster mode > --- > > Key: SPARK-13675 > URL: https://issues.apache.org/jira/browse/SPARK-13675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Assignee: Apache Spark > Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png > > > Current URL for each application to access history UI is like: > http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or > http://localhost:18080/history/application_1457058760338_0016/2/jobs/ > Here *1* or *2* represents the number of attempts in {{historypage.js}}, but > it will parse to attempt id in {{HistoryServer}}, while the correct attempt > id should be like "appattempt_1457058760338_0016_02", so it will fail to > parse to a correct attempt id in {{HistoryServer}}. > This is OK in yarn client mode, since we don't need this attempt id to fetch > out the app cache, but it is failed in yarn cluster mode, where attempt id > "1" or "2" is actually wrong. > So here we should fix this url to parse the correct application id and > attempt id. > This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. > Here is the screenshot: > !https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode
[ https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179683#comment-15179683 ] Apache Spark commented on SPARK-13675: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/11518 > The url link in historypage is not correct for application running in yarn > cluster mode > --- > > Key: SPARK-13675 > URL: https://issues.apache.org/jira/browse/SPARK-13675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao > Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png > > > Current URL for each application to access history UI is like: > http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or > http://localhost:18080/history/application_1457058760338_0016/2/jobs/ > Here *1* or *2* represents the number of attempts in {{historypage.js}}, but > it will parse to attempt id in {{HistoryServer}}, while the correct attempt > id should be like "appattempt_1457058760338_0016_02", so it will fail to > parse to a correct attempt id in {{HistoryServer}}. > This is OK in yarn client mode, since we don't need this attempt id to fetch > out the app cache, but it is failed in yarn cluster mode, where attempt id > "1" or "2" is actually wrong. > So here we should fix this url to parse the correct application id and > attempt id. > This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6. > Here is the screenshot: > !https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13398) Move away from deprecated ThreadPoolTaskSupport
[ https://issues.apache.org/jira/browse/SPARK-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13398. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11423 [https://github.com/apache/spark/pull/11423] > Move away from deprecated ThreadPoolTaskSupport > --- > > Key: SPARK-13398 > URL: https://issues.apache.org/jira/browse/SPARK-13398 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: holdenk >Priority: Trivial > Fix For: 2.0.0 > > > ThreadPoolTaskSupport has been replaced by ForkJoinTaskSupport -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13398) Move away from deprecated ThreadPoolTaskSupport
[ https://issues.apache.org/jira/browse/SPARK-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13398: -- Assignee: holdenk > Move away from deprecated ThreadPoolTaskSupport > --- > > Key: SPARK-13398 > URL: https://issues.apache.org/jira/browse/SPARK-13398 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: holdenk >Assignee: holdenk >Priority: Trivial > Fix For: 2.0.0 > > > ThreadPoolTaskSupport has been replaced by ForkJoinTaskSupport -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12925) Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject
[ https://issues.apache.org/jira/browse/SPARK-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12925: -- Priority: Minor (was: Major) > Improve HiveInspectors.unwrap for > StringObjectInspector.getPrimitiveWritableObject > -- > > Key: SPARK-12925 > URL: https://issues.apache.org/jira/browse/SPARK-12925 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Rajesh Balamohan >Assignee: Rajesh Balamohan >Priority: Minor > Fix For: 2.0.0 > > Attachments: SPARK-12925_profiler_cpu_samples.png > > > Text is in UTF-8 and converting it via "UTF8String.fromString" incurs > decoding and encoding, which turns out to be expensive. (to be specific: > https://github.com/apache/spark/blob/0d543b98f3e3da5053f0476f4647a765460861f3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L323) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression
Dongjoon Hyun created SPARK-13676: - Summary: Fix mismatched default values for regParam in LogisticRegression Key: SPARK-13676 URL: https://issues.apache.org/jira/browse/SPARK-13676 Project: Spark Issue Type: Bug Components: ML Reporter: Dongjoon Hyun The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value. {code:title=Scala|borderStyle=solid} scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam res0: Double = 0.0 {code} {code:title=Python|borderStyle=solid} >>> from pyspark.ml.classification import LogisticRegression >>> LogisticRegression().getRegParam() 0.1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-13676: -- Component/s: MLlib > Fix mismatched default values for regParam in LogisticRegression > > > Key: SPARK-13676 > URL: https://issues.apache.org/jira/browse/SPARK-13676 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Dongjoon Hyun > > The default value of regularization parameter for `LogisticRegression` > algorithm is different in Scala and Python. We should provide the same value. > {code:title=Scala|borderStyle=solid} > scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam > res0: Double = 0.0 > {code} > {code:title=Python|borderStyle=solid} > >>> from pyspark.ml.classification import LogisticRegression > >>> LogisticRegression().getRegParam() > 0.1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179792#comment-15179792 ] Apache Spark commented on SPARK-13676: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/11519 > Fix mismatched default values for regParam in LogisticRegression > > > Key: SPARK-13676 > URL: https://issues.apache.org/jira/browse/SPARK-13676 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Dongjoon Hyun > > The default value of regularization parameter for `LogisticRegression` > algorithm is different in Scala and Python. We should provide the same value. > {code:title=Scala|borderStyle=solid} > scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam > res0: Double = 0.0 > {code} > {code:title=Python|borderStyle=solid} > >>> from pyspark.ml.classification import LogisticRegression > >>> LogisticRegression().getRegParam() > 0.1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13676: Assignee: (was: Apache Spark) > Fix mismatched default values for regParam in LogisticRegression > > > Key: SPARK-13676 > URL: https://issues.apache.org/jira/browse/SPARK-13676 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Dongjoon Hyun > > The default value of regularization parameter for `LogisticRegression` > algorithm is different in Scala and Python. We should provide the same value. > {code:title=Scala|borderStyle=solid} > scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam > res0: Double = 0.0 > {code} > {code:title=Python|borderStyle=solid} > >>> from pyspark.ml.classification import LogisticRegression > >>> LogisticRegression().getRegParam() > 0.1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13676: Assignee: Apache Spark > Fix mismatched default values for regParam in LogisticRegression > > > Key: SPARK-13676 > URL: https://issues.apache.org/jira/browse/SPARK-13676 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > The default value of regularization parameter for `LogisticRegression` > algorithm is different in Scala and Python. We should provide the same value. > {code:title=Scala|borderStyle=solid} > scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam > res0: Double = 0.0 > {code} > {code:title=Python|borderStyle=solid} > >>> from pyspark.ml.classification import LogisticRegression > >>> LogisticRegression().getRegParam() > 0.1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs
[ https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179796#comment-15179796 ] Sean Owen commented on SPARK-13596: --- [~nchammas] do you happen to know how we can configure stuff to expect {{tox.ini}} in the {{python}} directory instead? I'm trying to clean up the top level. > Move misc top-level build files into appropriate subdirs > > > Key: SPARK-13596 > URL: https://issues.apache.org/jira/browse/SPARK-13596 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Sean Owen > > I'd like to file away a bunch of misc files that are in the top level of the > project in order to further tidy the build for 2.0.0. See also SPARK-13529, > SPARK-13548. > Some of these may turn out to be difficult or impossible to move. > I'd ideally like to move these files into {{build/}}: > - {{.rat-excludes}} > - {{checkstyle.xml}} > - {{checkstyle-suppressions.xml}} > - {{pylintrc}} > - {{scalastyle-config.xml}} > - {{tox.ini}} > - {{project/}} (or does SBT need this in the root?) > And ideally, these would go under {{dev/}} > - {{make-distribution.sh}} > And remove these > - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?) > Edited to add: apparently this can go in {{.github}} now: > - {{CONTRIBUTING.md}} > Other files in the top level seem to need to be there, like {{README.md}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13677) Support Tree-Based Feature Transformation for mllib
zhengruifeng created SPARK-13677: Summary: Support Tree-Based Feature Transformation for mllib Key: SPARK-13677 URL: https://issues.apache.org/jira/browse/SPARK-13677 Project: Spark Issue Type: New Feature Reporter: zhengruifeng Priority: Minor It would be nice to be able to use RF and GBT for feature transformation: First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. This method was first introduced by facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is implemented in two famous library: sklearn (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) xgboost (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) I have implement it in mllib: val features : RDD[Vector] = ... val model1 : RandomForestModel = ... val transformed1 : RDD[Vector] = model1.leaf(features) val model2 : GradientBoostedTreesModel = ... val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13677) Support Tree-Based Feature Transformation for mllib
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13677: Assignee: Apache Spark > Support Tree-Based Feature Transformation for mllib > --- > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is > implemented in two famous library: > sklearn > (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) > xgboost > (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) > I have implement it in mllib: > val features : RDD[Vector] = ... > val model1 : RandomForestModel = ... > val transformed1 : RDD[Vector] = model1.leaf(features) > val model2 : GradientBoostedTreesModel = ... > val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13677) Support Tree-Based Feature Transformation for mllib
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179827#comment-15179827 ] Apache Spark commented on SPARK-13677: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/11520 > Support Tree-Based Feature Transformation for mllib > --- > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature >Reporter: zhengruifeng >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is > implemented in two famous library: > sklearn > (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) > xgboost > (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) > I have implement it in mllib: > val features : RDD[Vector] = ... > val model1 : RandomForestModel = ... > val transformed1 : RDD[Vector] = model1.leaf(features) > val model2 : GradientBoostedTreesModel = ... > val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13677) Support Tree-Based Feature Transformation for mllib
[ https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13677: Assignee: (was: Apache Spark) > Support Tree-Based Feature Transformation for mllib > --- > > Key: SPARK-13677 > URL: https://issues.apache.org/jira/browse/SPARK-13677 > Project: Spark > Issue Type: New Feature >Reporter: zhengruifeng >Priority: Minor > > It would be nice to be able to use RF and GBT for feature transformation: > First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on > the training set. Then each leaf of each tree in the ensemble is assigned a > fixed arbitrary feature index in a new feature space. These leaf indices are > then encoded in a one-hot fashion. > This method was first introduced by > facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is > implemented in two famous library: > sklearn > (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py) > xgboost > (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py) > I have implement it in mllib: > val features : RDD[Vector] = ... > val model1 : RandomForestModel = ... > val transformed1 : RDD[Vector] = model1.leaf(features) > val model2 : GradientBoostedTreesModel = ... > val transformed2 : RDD[Vector] = model2.leaf(features) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13489) GSoC 2016 project ideas for MLlib
[ https://issues.apache.org/jira/browse/SPARK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179833#comment-15179833 ] Kai Jiang commented on SPARK-13489: --- [~josephkb] Thanks for your explanation! It seems like there are lots of missing models in SparkR. I opened a google docs ([link|https://docs.google.com/document/d/15h1IbuGJMQvqCU7kALZ4Qr6tZPPqI2hgXTYnJPIiFXg/edit?usp=sharing]) and put some ideas into it. Do you mind giving some suggestions about whether those ideas are suitable for GSoC project? cc [~mengxr] [~mlnick] > GSoC 2016 project ideas for MLlib > - > > Key: SPARK-13489 > URL: https://issues.apache.org/jira/browse/SPARK-13489 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > > I want to use this JIRA to collect some GSoC project ideas for MLlib. > Ideally, the student should have contributed to Spark. And the content of the > project could be divided into small functional pieces so that it won't get > stalled if the mentor is temporarily unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13678) transformExpressions should exclude expression that is not inside QueryPlan.expressions
Wenchen Fan created SPARK-13678: --- Summary: transformExpressions should exclude expression that is not inside QueryPlan.expressions Key: SPARK-13678 URL: https://issues.apache.org/jira/browse/SPARK-13678 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13678) transformExpressions should only apply on QueryPlan.expressions
[ https://issues.apache.org/jira/browse/SPARK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-13678: Summary: transformExpressions should only apply on QueryPlan.expressions (was: transformExpressions should exclude expression that is not inside QueryPlan.expressions) > transformExpressions should only apply on QueryPlan.expressions > --- > > Key: SPARK-13678 > URL: https://issues.apache.org/jira/browse/SPARK-13678 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13678) transformExpressions should only apply on QueryPlan.expressions
[ https://issues.apache.org/jira/browse/SPARK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13678: Assignee: Apache Spark > transformExpressions should only apply on QueryPlan.expressions > --- > > Key: SPARK-13678 > URL: https://issues.apache.org/jira/browse/SPARK-13678 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13678) transformExpressions should only apply on QueryPlan.expressions
[ https://issues.apache.org/jira/browse/SPARK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179871#comment-15179871 ] Apache Spark commented on SPARK-13678: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/11521 > transformExpressions should only apply on QueryPlan.expressions > --- > > Key: SPARK-13678 > URL: https://issues.apache.org/jira/browse/SPARK-13678 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13678) transformExpressions should only apply on QueryPlan.expressions
[ https://issues.apache.org/jira/browse/SPARK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13678: Assignee: (was: Apache Spark) > transformExpressions should only apply on QueryPlan.expressions > --- > > Key: SPARK-13678 > URL: https://issues.apache.org/jira/browse/SPARK-13678 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13596) Move misc top-level build files into appropriate subdirs
[ https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13596: Assignee: Apache Spark > Move misc top-level build files into appropriate subdirs > > > Key: SPARK-13596 > URL: https://issues.apache.org/jira/browse/SPARK-13596 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Sean Owen >Assignee: Apache Spark > > I'd like to file away a bunch of misc files that are in the top level of the > project in order to further tidy the build for 2.0.0. See also SPARK-13529, > SPARK-13548. > Some of these may turn out to be difficult or impossible to move. > I'd ideally like to move these files into {{build/}}: > - {{.rat-excludes}} > - {{checkstyle.xml}} > - {{checkstyle-suppressions.xml}} > - {{pylintrc}} > - {{scalastyle-config.xml}} > - {{tox.ini}} > - {{project/}} (or does SBT need this in the root?) > And ideally, these would go under {{dev/}} > - {{make-distribution.sh}} > And remove these > - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?) > Edited to add: apparently this can go in {{.github}} now: > - {{CONTRIBUTING.md}} > Other files in the top level seem to need to be there, like {{README.md}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs
[ https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179894#comment-15179894 ] Apache Spark commented on SPARK-13596: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/11522 > Move misc top-level build files into appropriate subdirs > > > Key: SPARK-13596 > URL: https://issues.apache.org/jira/browse/SPARK-13596 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Sean Owen > > I'd like to file away a bunch of misc files that are in the top level of the > project in order to further tidy the build for 2.0.0. See also SPARK-13529, > SPARK-13548. > Some of these may turn out to be difficult or impossible to move. > I'd ideally like to move these files into {{build/}}: > - {{.rat-excludes}} > - {{checkstyle.xml}} > - {{checkstyle-suppressions.xml}} > - {{pylintrc}} > - {{scalastyle-config.xml}} > - {{tox.ini}} > - {{project/}} (or does SBT need this in the root?) > And ideally, these would go under {{dev/}} > - {{make-distribution.sh}} > And remove these > - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?) > Edited to add: apparently this can go in {{.github}} now: > - {{CONTRIBUTING.md}} > Other files in the top level seem to need to be there, like {{README.md}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13596) Move misc top-level build files into appropriate subdirs
[ https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13596: Assignee: (was: Apache Spark) > Move misc top-level build files into appropriate subdirs > > > Key: SPARK-13596 > URL: https://issues.apache.org/jira/browse/SPARK-13596 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Sean Owen > > I'd like to file away a bunch of misc files that are in the top level of the > project in order to further tidy the build for 2.0.0. See also SPARK-13529, > SPARK-13548. > Some of these may turn out to be difficult or impossible to move. > I'd ideally like to move these files into {{build/}}: > - {{.rat-excludes}} > - {{checkstyle.xml}} > - {{checkstyle-suppressions.xml}} > - {{pylintrc}} > - {{scalastyle-config.xml}} > - {{tox.ini}} > - {{project/}} (or does SBT need this in the root?) > And ideally, these would go under {{dev/}} > - {{make-distribution.sh}} > And remove these > - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?) > Edited to add: apparently this can go in {{.github}} now: > - {{CONTRIBUTING.md}} > Other files in the top level seem to need to be there, like {{README.md}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.
[ https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13673: -- Assignee: Masayoshi TSUZUKI > script bin\beeline.cmd pollutes environment variables in Windows. > - > > Key: SPARK-13673 > URL: https://issues.apache.org/jira/browse/SPARK-13673 > Project: Spark > Issue Type: Improvement > Components: Windows >Affects Versions: 1.6.0 > Environment: Windows 8.1 >Reporter: Masayoshi TSUZUKI >Assignee: Masayoshi TSUZUKI >Priority: Minor > Fix For: 2.0.0 > > > {{bin\beeline.cmd}} pollutes environment variables in Windows. > The similar problem is reported and fixed in [SPARK-3943], but > {{bin\beeline.cmd}} is added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.
[ https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13673. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11516 [https://github.com/apache/spark/pull/11516] > script bin\beeline.cmd pollutes environment variables in Windows. > - > > Key: SPARK-13673 > URL: https://issues.apache.org/jira/browse/SPARK-13673 > Project: Spark > Issue Type: Improvement > Components: Windows >Affects Versions: 1.6.0 > Environment: Windows 8.1 >Reporter: Masayoshi TSUZUKI >Priority: Minor > Fix For: 2.0.0 > > > {{bin\beeline.cmd}} pollutes environment variables in Windows. > The similar problem is reported and fixed in [SPARK-3943], but > {{bin\beeline.cmd}} is added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11515) QuantileDiscretizer should take random seed
[ https://issues.apache.org/jira/browse/SPARK-11515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11515: -- Fix Version/s: 1.6.2 > QuantileDiscretizer should take random seed > --- > > Key: SPARK-11515 > URL: https://issues.apache.org/jira/browse/SPARK-11515 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yu Ishikawa >Priority: Minor > Fix For: 1.6.2, 2.0.0 > > > QuantileDiscretizer takes a random sample to select bins. It currently does > not specify a seed for the XORShiftRandom, but it should take a seed by > extending the HasSeed Param. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180011#comment-15180011 ] Daniel Jouany commented on SPARK-10795: --- Hi there, If i follow your suggestions, it works. Our code was like that : {{ Import numpy as np Import SparkContext foo = np.genfromtext(x) sc=SparkContext(...) #compute }} *===> It fails* We have just moved the global variable initialization *after* the context init: {{ Import numpy as np Import SparkContext global foo sc=SparkContext(...) foo = np.genfromtext(x) #compute }} *===> It works perfectly* Note that you could reproduce this behaviour with something else than a numpy call - eventhough not every statement does entail the crash. The question is : why is this *non-spark* variable init interfering with the SparkContext > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180011#comment-15180011 ] Daniel Jouany edited comment on SPARK-10795 at 3/4/16 3:30 PM: --- Hi there, If i follow your suggestions, it works. Our code was like that : {code} Import numpy as np Import SparkContext foo = np.genfromtext(x) sc=SparkContext(...) #compute {code} *===> It fails* We have just moved the global variable initialization *after* the context init: {code} Import numpy as np Import SparkContext global foo sc=SparkContext(...) foo = np.genfromtext(x) #compute {code} *===> It works perfectly* Note that you could reproduce this behaviour with something else than a numpy call - eventhough not every statement does entail the crash. The question is : why is this *non-spark* variable init interfering with the SparkContext was (Author: djouany): Hi there, If i follow your suggestions, it works. Our code was like that : {{ Import numpy as np Import SparkContext foo = np.genfromtext(x) sc=SparkContext(...) #compute }} *===> It fails* We have just moved the global variable initialization *after* the context init: {{ Import numpy as np Import SparkContext global foo sc=SparkContext(...) foo = np.genfromtext(x) #compute }} *===> It works perfectly* Note that you could reproduce this behaviour with something else than a numpy call - eventhough not every statement does entail the crash. The question is : why is this *non-spark* variable init interfering with the SparkContext > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3374) Spark on Yarn remove deprecated configs for 2.0
[ https://issues.apache.org/jira/browse/SPARK-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180027#comment-15180027 ] Thomas Graves commented on SPARK-3374: -- [~srowen] can you add [~jerrypeng] as a contributor so he can assign himself to jira? > Spark on Yarn remove deprecated configs for 2.0 > --- > > Key: SPARK-3374 > URL: https://issues.apache.org/jira/browse/SPARK-3374 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.1.0 >Reporter: Thomas Graves > > The configs in yarn have gotten scattered and inconsistent between cluster > and client modes and supporting backwards compatibility. We should try to > clean this up, move things to common places and support configs across both > cluster and client modes where we want to make them public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3374) Spark on Yarn remove deprecated configs for 2.0
[ https://issues.apache.org/jira/browse/SPARK-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180031#comment-15180031 ] Sean Owen commented on SPARK-3374: -- Yes and I'll make you an admin so you can assign. > Spark on Yarn remove deprecated configs for 2.0 > --- > > Key: SPARK-3374 > URL: https://issues.apache.org/jira/browse/SPARK-3374 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 1.1.0 >Reporter: Thomas Graves > > The configs in yarn have gotten scattered and inconsistent between cluster > and client modes and supporting backwards compatibility. We should try to > clean this up, move things to common places and support configs across both > cluster and client modes where we want to make them public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13595) Move docker, extras modules into external
[ https://issues.apache.org/jira/browse/SPARK-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180037#comment-15180037 ] Apache Spark commented on SPARK-13595: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/11523 > Move docker, extras modules into external > - > > Key: SPARK-13595 > URL: https://issues.apache.org/jira/browse/SPARK-13595 > Project: Spark > Issue Type: Sub-task > Components: Build, Examples >Affects Versions: 2.0.0 >Reporter: Sean Owen > > See also SPARK-13529, SPARK-13548. In the same spirit [~rxin] I'd like to put > the {{docker}} and {{docker-integration-test}} modules, and everything under > {{extras}}, under {{external}}. This groups these pretty logically related > modules and removes three top-level dirs. > I'll take a look at it and see if there are any complications that this would > entail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13595) Move docker, extras modules into external
[ https://issues.apache.org/jira/browse/SPARK-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13595: Assignee: (was: Apache Spark) > Move docker, extras modules into external > - > > Key: SPARK-13595 > URL: https://issues.apache.org/jira/browse/SPARK-13595 > Project: Spark > Issue Type: Sub-task > Components: Build, Examples >Affects Versions: 2.0.0 >Reporter: Sean Owen > > See also SPARK-13529, SPARK-13548. In the same spirit [~rxin] I'd like to put > the {{docker}} and {{docker-integration-test}} modules, and everything under > {{extras}}, under {{external}}. This groups these pretty logically related > modules and removes three top-level dirs. > I'll take a look at it and see if there are any complications that this would > entail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13595) Move docker, extras modules into external
[ https://issues.apache.org/jira/browse/SPARK-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13595: Assignee: Apache Spark > Move docker, extras modules into external > - > > Key: SPARK-13595 > URL: https://issues.apache.org/jira/browse/SPARK-13595 > Project: Spark > Issue Type: Sub-task > Components: Build, Examples >Affects Versions: 2.0.0 >Reporter: Sean Owen >Assignee: Apache Spark > > See also SPARK-13529, SPARK-13548. In the same spirit [~rxin] I'd like to put > the {{docker}} and {{docker-integration-test}} modules, and everything under > {{extras}}, under {{external}}. This groups these pretty logically related > modules and removes three top-level dirs. > I'll take a look at it and see if there are any complications that this would > entail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1
[ https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180047#comment-15180047 ] Apache Spark commented on SPARK-13663: -- User 'yy2016' has created a pull request for this issue: https://github.com/apache/spark/pull/11524 > Upgrade Snappy Java to 1.1.2.1 > -- > > Key: SPARK-13663 > URL: https://issues.apache.org/jira/browse/SPARK-13663 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Priority: Minor > > The JVM memory leaky problem reported in > https://github.com/xerial/snappy-java/issues/131 has been resolved. > 1.1.2.1 was released on Jan 22nd. > We should upgrade to this release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1
[ https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13663: Assignee: Apache Spark > Upgrade Snappy Java to 1.1.2.1 > -- > > Key: SPARK-13663 > URL: https://issues.apache.org/jira/browse/SPARK-13663 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Minor > > The JVM memory leaky problem reported in > https://github.com/xerial/snappy-java/issues/131 has been resolved. > 1.1.2.1 was released on Jan 22nd. > We should upgrade to this release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1
[ https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13663: Assignee: (was: Apache Spark) > Upgrade Snappy Java to 1.1.2.1 > -- > > Key: SPARK-13663 > URL: https://issues.apache.org/jira/browse/SPARK-13663 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Ted Yu >Priority: Minor > > The JVM memory leaky problem reported in > https://github.com/xerial/snappy-java/issues/131 has been resolved. > 1.1.2.1 was released on Jan 22nd. > We should upgrade to this release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13679) Pyspark job fails with Oozie
Alexandre Linte created SPARK-13679: --- Summary: Pyspark job fails with Oozie Key: SPARK-13679 URL: https://issues.apache.org/jira/browse/SPARK-13679 Project: Spark Issue Type: Bug Components: PySpark, Spark Submit, YARN Affects Versions: 1.6.0 Environment: Hadoop 2.7.2, Spark 1.6.0 on Yarn, Oozie 4.2.0 Cluster secured with Kerberos Reporter: Alexandre Linte Hello, I'm trying to run pi.py example in a pyspark job with Oozie. Every try I made failed for the same reason: key not found: SPARK_HOME. Note: A scala job works well in the environment with Oozie. The logs on the executors are: {noformat} SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mnt/hd4/hadoop/yarn/local/filecache/145/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/mnt/hd2/hadoop/yarn/local/filecache/155/spark-assembly-1.6.0-hadoop2.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/application/Hadoop/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /mnt/hd7/hadoop/yarn/log/application_1454673025841_13136/container_1454673025841_13136_01_01 (Is a directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:142) at org.apache.log4j.FileAppender.setFile(FileAppender.java:294) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165) at org.apache.hadoop.yarn.ContainerLogAppender.activateOptions(ContainerLogAppender.java:55) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:809) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:735) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:615) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:502) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:547) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:483) at org.apache.log4j.LogManager.(LogManager.java:127) at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:285) at org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:155) at org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:132) at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:275) at org.apache.hadoop.service.AbstractService.(AbstractService.java:43) Using properties file: null Parsed arguments: master yarn-master deployMode cluster executorMemory null executorCores null totalExecutorCores null propertiesFile null driverMemorynull driverCores null driverExtraClassPathnull driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutorsnull files null pyFiles null archivesnull mainClass null primaryResource hdfs://hadoopsandbox/User/toto/WORK/Oozie/pyspark/lib/pi.py namePysparkpi example childArgs [100] jarsnull packagesnull packagesExclusions null repositoriesnull verbose true Spark properties used, including those specified through --conf and those from the properties file null: spark.executorEnv.SPARK_HOME -> /opt/application/Spark/current spark.executorEnv.PYTHONPATH -> /opt/application/Spark/current/python spark.yarn.appMasterEnv.SPARK_HOME -> /opt/application/Spark/current Main class: org.apache.spark.deploy.yarn.Client Arguments: --name Pysparkpi example --primary-py-file hdfs://hadoopsandbox/User/toto/WORK/Oozie/pyspark/lib/pi.py --class org.apache.spark.deploy.PythonRunner --arg 100 System properties: spark.executorEnv.SPARK_HOME -> /opt/application/Spark/curre
[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs
[ https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180072#comment-15180072 ] Nicholas Chammas commented on SPARK-13596: -- Looks like {{tox.ini}} is only used by {{pep8}}, so if you move it into {{dev/}}, where the Python lint checks run from, that should work. > Move misc top-level build files into appropriate subdirs > > > Key: SPARK-13596 > URL: https://issues.apache.org/jira/browse/SPARK-13596 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.0.0 >Reporter: Sean Owen > > I'd like to file away a bunch of misc files that are in the top level of the > project in order to further tidy the build for 2.0.0. See also SPARK-13529, > SPARK-13548. > Some of these may turn out to be difficult or impossible to move. > I'd ideally like to move these files into {{build/}}: > - {{.rat-excludes}} > - {{checkstyle.xml}} > - {{checkstyle-suppressions.xml}} > - {{pylintrc}} > - {{scalastyle-config.xml}} > - {{tox.ini}} > - {{project/}} (or does SBT need this in the root?) > And ideally, these would go under {{dev/}} > - {{make-distribution.sh}} > And remove these > - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?) > Edited to add: apparently this can go in {{.github}} now: > - {{CONTRIBUTING.md}} > Other files in the top level seem to need to be there, like {{README.md}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results
Yael Aharon created SPARK-13680: --- Summary: Java UDAF with more than one intermediate argument returns wrong results Key: SPARK-13680 URL: https://issues.apache.org/jira/browse/SPARK-13680 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Environment: CDH 5.5.2 Reporter: Yael Aharon I am trying to incorporate the Java UDAF from https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java into an SQL query. I registered the UDAF like this: sqlContext.udf().register("myavg", new MyDoubleAvg()); My SQL query is: SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS `sum_stdevi`, myavg(seqd) as `myavg_seqd`, AVG(zero) AS `avg_zero`, AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count(*) AS `count_all`, count(nulli) AS `count_nulli` FROM mytable As soon as I add the UDAF myavg to the SQL, all the results become incorrect. When I remove the call to the UDAF, the results are correct. I was able to go around the issue by modifying bufferSchema of the UDAF to use an array and the corresponding update and merge methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yael Aharon updated SPARK-13680: Attachment: data.csv > Java UDAF with more than one intermediate argument returns wrong results > > > Key: SPARK-13680 > URL: https://issues.apache.org/jira/browse/SPARK-13680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: CDH 5.5.2 >Reporter: Yael Aharon > Attachments: data.csv > > > I am trying to incorporate the Java UDAF from > https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java > into an SQL query. > I registered the UDAF like this: > sqlContext.udf().register("myavg", new MyDoubleAvg()); > My SQL query is: > SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, > AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS > `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS > `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS > `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS > `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS > `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS > `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS > `sum_stdevi`, myavg(seqd) as `myavg_seqd`, AVG(zero) AS `avg_zero`, > AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, > SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, > MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count(*) AS `count_all`, > count(nulli) AS `count_nulli` FROM mytable > As soon as I add the UDAF myavg to the SQL, all the results become incorrect. > When I remove the call to the UDAF, the results are correct. > I was able to go around the issue by modifying bufferSchema of the UDAF to > use an array and the corresponding update and merge methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13676. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11519 [https://github.com/apache/spark/pull/11519] > Fix mismatched default values for regParam in LogisticRegression > > > Key: SPARK-13676 > URL: https://issues.apache.org/jira/browse/SPARK-13676 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Dongjoon Hyun > Fix For: 2.0.0 > > > The default value of regularization parameter for `LogisticRegression` > algorithm is different in Scala and Python. We should provide the same value. > {code:title=Scala|borderStyle=solid} > scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam > res0: Double = 0.0 > {code} > {code:title=Python|borderStyle=solid} > >>> from pyspark.ml.classification import LogisticRegression > >>> LogisticRegression().getRegParam() > 0.1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180096#comment-15180096 ] Yael Aharon commented on SPARK-13680: - I attached data.csv which is the data used for this test > Java UDAF with more than one intermediate argument returns wrong results > > > Key: SPARK-13680 > URL: https://issues.apache.org/jira/browse/SPARK-13680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: CDH 5.5.2 >Reporter: Yael Aharon > Attachments: data.csv > > > I am trying to incorporate the Java UDAF from > https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java > into an SQL query. > I registered the UDAF like this: > sqlContext.udf().register("myavg", new MyDoubleAvg()); > My SQL query is: > SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, > AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS > `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS > `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS > `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS > `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS > `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS > `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS > `sum_stdevi`, myavg(seqd) as `myavg_seqd`, AVG(zero) AS `avg_zero`, > AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, > SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, > MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count(*) AS `count_all`, > count(nulli) AS `count_nulli` FROM mytable > As soon as I add the UDAF myavg to the SQL, all the results become incorrect. > When I remove the call to the UDAF, the results are correct. > I was able to go around the issue by modifying bufferSchema of the UDAF to > use an array and the corresponding update and merge methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13676: -- Target Version/s: 2.0.0 > Fix mismatched default values for regParam in LogisticRegression > > > Key: SPARK-13676 > URL: https://issues.apache.org/jira/browse/SPARK-13676 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > The default value of regularization parameter for `LogisticRegression` > algorithm is different in Scala and Python. We should provide the same value. > {code:title=Scala|borderStyle=solid} > scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam > res0: Double = 0.0 > {code} > {code:title=Python|borderStyle=solid} > >>> from pyspark.ml.classification import LogisticRegression > >>> LogisticRegression().getRegParam() > 0.1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13676: -- Assignee: Dongjoon Hyun > Fix mismatched default values for regParam in LogisticRegression > > > Key: SPARK-13676 > URL: https://issues.apache.org/jira/browse/SPARK-13676 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > The default value of regularization parameter for `LogisticRegression` > algorithm is different in Scala and Python. We should provide the same value. > {code:title=Scala|borderStyle=solid} > scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam > res0: Double = 0.0 > {code} > {code:title=Python|borderStyle=solid} > >>> from pyspark.ml.classification import LogisticRegression > >>> LogisticRegression().getRegParam() > 0.1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13036) PySpark ml.feature support export/import
[ https://issues.apache.org/jira/browse/SPARK-13036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13036. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11203 [https://github.com/apache/spark/pull/11203] > PySpark ml.feature support export/import > > > Key: SPARK-13036 > URL: https://issues.apache.org/jira/browse/SPARK-13036 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Xusen Yin >Priority: Minor > Fix For: 2.0.0 > > > Add export/import for all estimators and transformers(which have Scala > implementation) under pyspark/ml/feature.py. Please refer the implementation > at SPARK-13032. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13318: -- Target Version/s: 2.0.0 > Model export/import for spark.ml: ElementwiseProduct > > > Key: SPARK-13318 > URL: https://issues.apache.org/jira/browse/SPARK-13318 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Assignee: Xusen Yin >Priority: Minor > Fix For: 2.0.0 > > > Add save/load to ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13319) Pyspark VectorSlicer, StopWordsRemvoer should have setDefault
[ https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13319. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11203 [https://github.com/apache/spark/pull/11203] > Pyspark VectorSlicer, StopWordsRemvoer should have setDefault > - > > Key: SPARK-13319 > URL: https://issues.apache.org/jira/browse/SPARK-13319 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Priority: Minor > Fix For: 2.0.0 > > > Pyspark VectorSlicer should have setDefault, otherwise it will cause error > when calling getNames or getIndices. > StopWordsRemover need to set defalut value for "caseSensitive". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13319) Pyspark VectorSlicer, StopWordsRemvoer should have setDefault
[ https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13319: -- Target Version/s: 2.0.0 > Pyspark VectorSlicer, StopWordsRemvoer should have setDefault > - > > Key: SPARK-13319 > URL: https://issues.apache.org/jira/browse/SPARK-13319 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Assignee: Xusen Yin >Priority: Minor > Fix For: 2.0.0 > > > Pyspark VectorSlicer should have setDefault, otherwise it will cause error > when calling getNames or getIndices. > StopWordsRemover need to set defalut value for "caseSensitive". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-13318. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11203 [https://github.com/apache/spark/pull/11203] > Model export/import for spark.ml: ElementwiseProduct > > > Key: SPARK-13318 > URL: https://issues.apache.org/jira/browse/SPARK-13318 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > Fix For: 2.0.0 > > > Add save/load to ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13319) Pyspark VectorSlicer, StopWordsRemvoer should have setDefault
[ https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13319: -- Assignee: Xusen Yin > Pyspark VectorSlicer, StopWordsRemvoer should have setDefault > - > > Key: SPARK-13319 > URL: https://issues.apache.org/jira/browse/SPARK-13319 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Xusen Yin >Assignee: Xusen Yin >Priority: Minor > Fix For: 2.0.0 > > > Pyspark VectorSlicer should have setDefault, otherwise it will cause error > when calling getNames or getIndices. > StopWordsRemover need to set defalut value for "caseSensitive". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct
[ https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-13318: -- Assignee: Xusen Yin > Model export/import for spark.ml: ElementwiseProduct > > > Key: SPARK-13318 > URL: https://issues.apache.org/jira/browse/SPARK-13318 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Assignee: Xusen Yin >Priority: Minor > Fix For: 2.0.0 > > > Add save/load to ElementwiseProduct -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13494) Cannot sort on a column which is of type "array"
[ https://issues.apache.org/jira/browse/SPARK-13494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172115#comment-15172115 ] Yael Aharon edited comment on SPARK-13494 at 3/4/16 4:35 PM: - I am using Spark 1.5 from Cloudera distribution CDH 5.5.2 . Do you think this was fixed since? The Hive schema of the column in question is array was (Author: yael): I am using Spark 5.2 from Cloudera distribution CDH 5.2 . Do you think this was fixed since? The Hive schema of the column in question is array > Cannot sort on a column which is of type "array" > > > Key: SPARK-13494 > URL: https://issues.apache.org/jira/browse/SPARK-13494 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yael Aharon > > Executing the following SQL results in an error if columnName refers to a > column of type array > SELECT * FROM myTable ORDER BY columnName ASC LIMIT 50 > The error is > org.apache.spark.sql.AnalysisException: cannot resolve 'columnName ASC' due > to data type mismatch: cannot sort data type array -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yael Aharon updated SPARK-13680: Description: I am trying to incorporate the Java UDAF from https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java into an SQL query. I registered the UDAF like this: sqlContext.udf().register("myavg", new MyDoubleAvg()); My SQL query is: SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS `sum_stdevi`, myavg(seqd) as `myavg_seqd`, AVG(zero) AS `avg_zero`, AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS `count_all`, count(nulli) AS `count_nulli` FROM mytable As soon as I add the UDAF myavg to the SQL, all the results become incorrect. When I remove the call to the UDAF, the results are correct. I was able to go around the issue by modifying bufferSchema of the UDAF to use an array and the corresponding update and merge methods. was: I am trying to incorporate the Java UDAF from https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java into an SQL query. I registered the UDAF like this: sqlContext.udf().register("myavg", new MyDoubleAvg()); My SQL query is: SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS `sum_stdevi`, myavg(seqd) as `myavg_seqd`, AVG(zero) AS `avg_zero`, AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count(*) AS `count_all`, count(nulli) AS `count_nulli` FROM mytable As soon as I add the UDAF myavg to the SQL, all the results become incorrect. When I remove the call to the UDAF, the results are correct. I was able to go around the issue by modifying bufferSchema of the UDAF to use an array and the corresponding update and merge methods. > Java UDAF with more than one intermediate argument returns wrong results > > > Key: SPARK-13680 > URL: https://issues.apache.org/jira/browse/SPARK-13680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: CDH 5.5.2 >Reporter: Yael Aharon > Attachments: data.csv > > > I am trying to incorporate the Java UDAF from > https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java > into an SQL query. > I registered the UDAF like this: > sqlContext.udf().register("myavg", new MyDoubleAvg()); > My SQL query is: > SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, > AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS > `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS > `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS > `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS > `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS > `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS > `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS > `sum_stdevi`, myavg(seqd) as `myavg_seqd`, AVG(zero) AS `avg_zero`, > AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, > SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, > MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS > `cou
[jira] [Comment Edited] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180120#comment-15180120 ] Yael Aharon edited comment on SPARK-13680 at 3/4/16 4:42 PM: - I found this in the spark executor logs when running the MyDoubleAVG UDAF. Execution continued in spite of this exception: java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to java.lang.Long at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getLong(rows.scala:41) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getLong(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getLong(JoinedRow.scala:85) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply772_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:174) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:171) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:100) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:139) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:74) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) was (Author: yael): I found this in the spark executor logs when running the MyDoubleAVG UDAF. Execution continued in spite of this exception: java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to java.lang.Long at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getLong(rows.scala:41) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getLong(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getLong(JoinedRow.scala:85) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply772_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:174) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:171) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:100) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:139) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:74) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.
[jira] [Commented] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180120#comment-15180120 ] Yael Aharon commented on SPARK-13680: - I found this in the spark executor logs when running the MyDoubleAVG UDAF. Execution continued in spite of this exception: java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData cannot be cast to java.lang.Long at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getLong(rows.scala:41) at org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getLong(rows.scala:247) at org.apache.spark.sql.catalyst.expressions.JoinedRow.getLong(JoinedRow.scala:85) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply772_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:174) at org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:171) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:100) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:139) at org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:74) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) > Java UDAF with more than one intermediate argument returns wrong results > > > Key: SPARK-13680 > URL: https://issues.apache.org/jira/browse/SPARK-13680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: CDH 5.5.2 >Reporter: Yael Aharon > Attachments: data.csv > > > I am trying to incorporate the Java UDAF from > https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java > into an SQL query. > I registered the UDAF like this: > sqlContext.udf().register("myavg", new MyDoubleAvg()); > My SQL query is: > SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, > AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS > `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS > `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS > `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS > `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS > `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS > `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS > `sum_stdevi`, myavg(seqd) as `myavg_seqd`, AVG(zero) AS `avg_zero`, > AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, > SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, > MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS > `count_all`, count(nulli) AS `count_nulli` FROM mytable > As soon as I add the UDAF myavg to the SQL, all the results become incorrect. > When I remove the call to the UDAF, the results are correct. > I was able to go around the issue by modifying bufferSchema of the UDAF to > use an array and the corresponding update and merge methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216 ] Łukasz Gieroń commented on SPARK-13230: --- The issue here is the bug in Scala library, in deserialization of `HashMap1` objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017) > at > org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1
[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216 ] Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:34 PM: --- The issue here is the bug in Scala library, in deserialization of 'HashMap1' objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). was (Author: lgieron): The issue here is the bug in Scala library, in deserialization of `HashMap1` objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:101
[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216 ] Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:35 PM: --- The issue here is the bug in Scala library, in deserialization of `HashMap1` objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). was (Author: lgieron): The issue here is the bug in Scala library, in deserialization of 'HashMap1' objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:101
[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216 ] Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:36 PM: --- The issue here is the bug in Scala library, in deserialization of `HashMap1` objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). PS. Not sure why Jira doesn't recognize my backticks markdown. was (Author: lgieron): The issue here is the bug in Scala library, in deserialization of `HashMap1` objects. When they get deserialized, the internal `kv` field does not get deserialized (is left `null`), which causes a `NullPointerException` in `merged`. I've fixed this is Scala library, and it fixes the issue. I'm going to open a bug to Scala library and submit a pull request for it, and link that ticket here (if it's possible to link between Jiras). > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache
[jira] [Commented] (SPARK-13230) HashMap.merged not working properly with Spark
[ https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180234#comment-15180234 ] Sean Owen commented on SPARK-13230: --- Thanks, that's a great analysis. It sounds like we might need to close this as a Scala problem, and offer a workaround. For example, it's obviously possible to write a little function that accomplishes the same thing, and which I hope doesn't depend on serializing the same internal representation. (PS JIRA does not use markdown. Use pairs of curly braces to {{format as code}}. > HashMap.merged not working properly with Spark > -- > > Key: SPARK-13230 > URL: https://issues.apache.org/jira/browse/SPARK-13230 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0 >Reporter: Alin Treznai > > Using HashMap.merged with Spark fails with NullPointerException. > {noformat} > import org.apache.spark.{SparkConf, SparkContext} > import scala.collection.immutable.HashMap > object MergeTest { > def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => > HashMap[String, Long] = { > case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) } > } > def main(args: Array[String]) = { > val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> > 3L),HashMap("A" -> 2L, "C" -> 4L)) > val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]") > val sc = new SparkContext(conf) > val result = sc.parallelize(input).reduce(mergeFn) > println(s"Result=$result") > sc.stop() > } > } > {noformat} > Error message: > org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007) > at MergeTest$.main(MergeTest.scala:21) > at MergeTest.main(MergeTest.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > Caused by: java.lang.NullPointerException > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at > MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12) > at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148) > at > scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200) > at > scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322) > at > scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463) > at scala.collection.immutable.HashMap.merged(HashMap.scala:117) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12) > at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020) > at > org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017) > at > org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(
[jira] [Commented] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
[ https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180241#comment-15180241 ] Joseph K. Bradley commented on SPARK-13048: --- I'd say the best fix would be to add an option to LDA to not delete the last checkpoint. I'd prefer to expose this as a Param in the spark.ml API, but it could be added to the spark.mllib API as well if necessary. [~holdenk] I agree we need to figure out how to handle/control caching and checkpointing within Pipelines, but that will have to wait for after 2.0. [~jvstein] We try to minimize the public API. Although I agree with you about opening up APIs in principal, it have proven dangerous in practice. Even when we mark things DeveloperApi, many users still use those APIs, making it difficult to change them in the future. > EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel > -- > > Key: SPARK-13048 > URL: https://issues.apache.org/jira/browse/SPARK-13048 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 > Environment: Standalone Spark cluster >Reporter: Jeff Stein > > In EMLDAOptimizer, all checkpoints are deleted before returning the > DistributedLDAModel. > The most recent checkpoint is still necessary for operations on the > DistributedLDAModel under a couple scenarios: > - The graph doesn't fit in memory on the worker nodes (e.g. very large data > set). > - Late worker failures that require reading the now-dependent checkpoint. > I ran into this problem running a 10M record LDA model in a memory starved > environment. The model consistently failed in either the {{collect at > LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the > {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the > model). In both cases, a FileNotFoundException is thrown attempting to access > a checkpoint file. > I'm not sure what the correct fix is here; it might involve a class signature > change. An alternative simple fix is to leave the last checkpoint around and > expect the user to clean the checkpoint directory themselves. > {noformat} > java.io.FileNotFoundException: File does not exist: > /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071 > {noformat} > Relevant code is included below. > LDAOptimizer.scala: > {noformat} > override private[clustering] def getLDAModel(iterationTimes: > Array[Double]): LDAModel = { > require(graph != null, "graph is null, EMLDAOptimizer not initialized.") > this.graphCheckpointer.deleteAllCheckpoints() > // The constructor's default arguments assume gammaShape = 100 to ensure > equivalence in > // LDAModel.toLocal conversion > new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, > this.vocabSize, > Vectors.dense(Array.fill(this.k)(this.docConcentration)), > this.topicConcentration, > iterationTimes) > } > {noformat} > PeriodicCheckpointer.scala > {noformat} > /** >* Call this at the end to delete any remaining checkpoint files. >*/ > def deleteAllCheckpoints(): Unit = { > while (checkpointQueue.nonEmpty) { > removeCheckpointFile() > } > } > /** >* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files. >* This prints a warning but does not fail if the files cannot be removed. >*/ > private def removeCheckpointFile(): Unit = { > val old = checkpointQueue.dequeue() > // Since the old checkpoint is not deleted by Spark, we manually delete > it. > val fs = FileSystem.get(sc.hadoopConfiguration) > getCheckpointFiles(old).foreach { checkpointFile => > try { > fs.delete(new Path(checkpointFile), true) > } catch { > case e: Exception => > logWarning("PeriodicCheckpointer could not remove old checkpoint > file: " + > checkpointFile) > } > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint
[ https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180243#comment-15180243 ] Joseph K. Bradley commented on SPARK-13434: --- There are a few options here: * Temp fix: Reduce the number of executors, as you suggested. * Long-term for this RF implementation: Implement local training for deep trees. Spilling the current tree to disk would help, but I'd guess that local training would have a bigger impact. * Long-term fix via a separate RF implementation: I've been working for a long time on a column-partitioned implementation which will be better for tasks like yours with many features & deep trees. It's making progress but not yet ready to merge into Spark. > Reduce Spark RandomForest memory footprint > -- > > Key: SPARK-13434 > URL: https://issues.apache.org/jira/browse/SPARK-13434 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Ewan Higgs > Labels: decisiontree, mllib, randomforest > Attachments: heap-usage.log, rf-heap-usage.png > > > The RandomForest implementation can easily run out of memory on moderate > datasets. This was raised in the a user's benchmarking game on github > (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there > was a tracking issue, but I couldn't fine one. > Using Spark 1.6, a user of mine is running into problems running the > RandomForest training on largish datasets on machines with 64G memory and the > following in {{spark-defaults.conf}}: > {code} > spark.executor.cores 2 > spark.executor.instances 199 > spark.executor.memory 10240M > {code} > I reproduced the excessive memory use from the benchmark example (using an > input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell > --driver-memory 30G --executor-memory 30G}} and have a heap profile from a > single machine by running {{jmap -histo:live }}. I took a sample > every 5 seconds and at the peak it looks like this: > {code} > num #instances #bytes class name > -- >1: 5428073 8458773496 [D >2: 12293653 4124641992 [I >3: 32508964 1820501984 org.apache.spark.mllib.tree.model.Node >4: 53068426 1698189632 org.apache.spark.mllib.tree.model.Predict >5: 72853787 1165660592 scala.Some >6: 16263408 910750848 > org.apache.spark.mllib.tree.model.InformationGainStats >7: 72969 390492744 [B >8: 3327008 133080320 > org.apache.spark.mllib.tree.impl.DTStatsAggregator >9: 3754500 120144000 > scala.collection.immutable.HashMap$HashMap1 > 10: 3318349 106187168 org.apache.spark.mllib.tree.model.Split > 11: 3534946 84838704 > org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo > 12: 3764745 60235920 java.lang.Integer > 13: 3327008 53232128 > org.apache.spark.mllib.tree.impurity.EntropyAggregator > 14:380804 45361144 [C > 15:268887 34877128 > 16:268887 34431568 > 17:908377 34042760 [Lscala.collection.immutable.HashMap; > 18: 110 2640 > org.apache.spark.mllib.regression.LabeledPoint > 19: 110 2640 org.apache.spark.mllib.linalg.SparseVector > 20: 20206 25979864 > 21: 100 2400 org.apache.spark.mllib.tree.impl.TreePoint > 22: 100 2400 > org.apache.spark.mllib.tree.impl.BaggedPoint > 23:908332 21799968 > scala.collection.immutable.HashMap$HashTrieMap > 24: 20206 20158864 > 25: 17023 14380352 > 26:16 13308288 > [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator; > 27:445797 10699128 scala.Tuple2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results
[ https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yael Aharon updated SPARK-13680: Attachment: setup.hql > Java UDAF with more than one intermediate argument returns wrong results > > > Key: SPARK-13680 > URL: https://issues.apache.org/jira/browse/SPARK-13680 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: CDH 5.5.2 >Reporter: Yael Aharon > Attachments: data.csv, setup.hql > > > I am trying to incorporate the Java UDAF from > https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java > into an SQL query. > I registered the UDAF like this: > sqlContext.udf().register("myavg", new MyDoubleAvg()); > My SQL query is: > SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, > AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS > `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS > `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS > `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS > `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS > `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS > `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS > `sum_stdevi`, myavg(seqd) as `myavg_seqd`, AVG(zero) AS `avg_zero`, > AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, > SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, > MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS > `count_all`, count(nulli) AS `count_nulli` FROM mytable > As soon as I add the UDAF myavg to the SQL, all the results become incorrect. > When I remove the call to the UDAF, the results are correct. > I was able to go around the issue by modifying bufferSchema of the UDAF to > use an array and the corresponding update and merge methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13494) Cannot sort on a column which is of type "array"
[ https://issues.apache.org/jira/browse/SPARK-13494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180289#comment-15180289 ] Xiao Li commented on SPARK-13494: - Can you try the other newer versions? You know, a lot of issues have been fixed in each release. > Cannot sort on a column which is of type "array" > > > Key: SPARK-13494 > URL: https://issues.apache.org/jira/browse/SPARK-13494 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yael Aharon > > Executing the following SQL results in an error if columnName refers to a > column of type array > SELECT * FROM myTable ORDER BY columnName ASC LIMIT 50 > The error is > org.apache.spark.sql.AnalysisException: cannot resolve 'columnName ASC' due > to data type mismatch: cannot sort data type array -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package
[ https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-13633. --- Resolution: Fixed Fix Version/s: 2.0.0 > Move parser classes to o.a.s.sql.catalyst.parser package > > > Key: SPARK-13633 > URL: https://issues.apache.org/jira/browse/SPARK-13633 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13681) Reimplement CommitFailureTestRelationSuite
Michael Armbrust created SPARK-13681: Summary: Reimplement CommitFailureTestRelationSuite Key: SPARK-13681 URL: https://issues.apache.org/jira/browse/SPARK-13681 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13681) Reimplement CommitFailureTestRelationSuite
[ https://issues.apache.org/jira/browse/SPARK-13681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-13681: - Description: This test case got broken by [#11509|https://github.com/apache/spark/pull/11509]. We should reimplement it as a format. > Reimplement CommitFailureTestRelationSuite > -- > > Key: SPARK-13681 > URL: https://issues.apache.org/jira/browse/SPARK-13681 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Michael Armbrust >Priority: Blocker > > This test case got broken by > [#11509|https://github.com/apache/spark/pull/11509]. We should reimplement > it as a format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13682) Finalize the public API for FileFormat
Michael Armbrust created SPARK-13682: Summary: Finalize the public API for FileFormat Key: SPARK-13682 URL: https://issues.apache.org/jira/browse/SPARK-13682 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust The current file format interface needs to be cleaned up before its acceptable for public consumption: - Have a version that takes Row and does a conversion, hide the internal API. - Remove bucketing - Remove RDD and the broadcastedConf - Remove SQLContext (maybe include SparkSession?) - Pass a better conf object -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180427#comment-15180427 ] Marcelo Vanzin commented on SPARK-13670: After some fun playing with arcane bash syntax, here's something that worked for me: {code} run_command() { CMD=() while IFS='' read -d '' -r ARG; do echo "line: $ARG" CMD+=("$ARG") done if [ ${#CMD[@]} -gt 0 ]; then exec "${CMD[@]}" fi } set -o pipefail "$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" | run_command {code} Example: {noformat} $ ./bin/spark-shell NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes ahead of assembly. Exception in thread "main" java.lang.IllegalArgumentException: Testing, testing, testing... at org.apache.spark.launcher.Main.main(Main.java:93) $ echo $? 1 {noformat} > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180444#comment-15180444 ] Marcelo Vanzin commented on SPARK-13670: Note that will probably leave a bash process running somewhere alongside the Spark jvm, so probably would need tweaks to avoid that... bash is fun. > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command
[ https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180455#comment-15180455 ] Marcelo Vanzin commented on SPARK-13670: Actually scrap that, it breaks things when the spark-shell actually runs... back to the drawing board. > spark-class doesn't bubble up error from launcher command > - > > Key: SPARK-13670 > URL: https://issues.apache.org/jira/browse/SPARK-13670 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 >Reporter: Mark Grover >Priority: Minor > > There's a particular snippet in spark-class > [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that > runs the spark-launcher code in a subshell. > {code} > # The launcher library will print arguments separated by a NULL character, to > allow arguments with > # characters that would be otherwise interpreted by the shell. Read that in a > while loop, populating > # an array that will be used to exec the final command. > CMD=() > while IFS= read -d '' -r ARG; do > CMD+=("$ARG") > done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main > "$@") > {code} > The problem is that the if the launcher Main fails, this code still still > returns success and continues, even though the top level script is marked > {{set -e}}. This is because the launcher.Main is run within a subshell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org