[jira] [Created] (SPARK-13671) Use different physical plan for existing RDD and data sources

2016-03-04 Thread Davies Liu (JIRA)
Davies Liu created SPARK-13671:
--

 Summary: Use different physical plan for existing RDD and data 
sources
 Key: SPARK-13671
 URL: https://issues.apache.org/jira/browse/SPARK-13671
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Right now, we use PhysicalRDD for both existing RDD and data sources, they are 
becoming much different, we should use different physical plans for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10610) Using AppName instead of AppId in the name of all metrics

2016-03-04 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179507#comment-15179507
 ] 

Pete Robbins edited comment on SPARK-10610 at 3/4/16 8:01 AM:
--

I think the appId is an important piece of information when visualizing the 
metrics along with hostname, executorId etc. I'm writing a sink and reporter to 
push the metrics to Elasticsearch and I include these in the metrics types for 
better correlation. eg

{
"timestamp": "2016-03-03T15:58:31.903+",
"hostName": "9.20.187.127"
"applicationId": "app-20160303155742-0005",
"executorId": "driver",
"BlockManager_memory_maxMem_MB": 3933
  }

The appId and executorId I extract form the metric name. When the sink is 
instantiated I don't believe I have access to any Utils to obtain the current 
appId and executorId so I'm kind of relying on these being in the metric name 
for the moment.

Is it possible to make appId, applicationName, executorId avaiable to me via 
some Utils function that I have access to in a metrics Sink?

I guess I'm asking: How can I get hold of the SparkConf if I've not been passed 
it?


was (Author: robbinspg):
I think the appId is an important piece of information when visualizing the 
metrics along with hostname, executorId etc. I'm writing a sink and reporter to 
push the metrics to Elasticsearch and I include these in the metrics types for 
better correlation. eg

{
"timestamp": "2016-03-03T15:58:31.903+",
"hostName": "9.20.187.127"
"applicationId": "app-20160303155742-0005",
"executorId": "driver",
"BlockManager_memory_maxMem_MB": 3933
  }

The appId and executorId I extract form the metric name. When the sink is 
instantiated I don't believe I have access to any Utils to obtain the current 
appId and executorId so I'm kind of relying on these being in the metric name 
for the moment.

Is it possible to make appId, applicationName, executorId avaiable to me via 
some Utils function that I have access to in a metrics Sink?

> Using AppName instead of AppId in the name of all metrics
> -
>
> Key: SPARK-10610
> URL: https://issues.apache.org/jira/browse/SPARK-10610
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Yi Tian
>Priority: Minor
>
> When we using {{JMX}} to monitor spark system,  We have to configure the name 
> of target metrics in the monitor system. But the current name of metrics is 
> {{appId}} + {{executorId}} + {{source}} . So when the spark program 
> restarted, we have to update the name of metrics in the monitor system.
> We should add an optional configuration property to control whether using the 
> appName instead of appId in spark metrics system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13671) Use different physical plan for existing RDD and data sources

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179538#comment-15179538
 ] 

Apache Spark commented on SPARK-13671:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11514

> Use different physical plan for existing RDD and data sources
> -
>
> Key: SPARK-13671
> URL: https://issues.apache.org/jira/browse/SPARK-13671
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we use PhysicalRDD for both existing RDD and data sources, they 
> are becoming much different, we should use different physical plans for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13671) Use different physical plan for existing RDD and data sources

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13671:


Assignee: Apache Spark  (was: Davies Liu)

> Use different physical plan for existing RDD and data sources
> -
>
> Key: SPARK-13671
> URL: https://issues.apache.org/jira/browse/SPARK-13671
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Right now, we use PhysicalRDD for both existing RDD and data sources, they 
> are becoming much different, we should use different physical plans for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13671) Use different physical plan for existing RDD and data sources

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13671:


Assignee: Davies Liu  (was: Apache Spark)

> Use different physical plan for existing RDD and data sources
> -
>
> Key: SPARK-13671
> URL: https://issues.apache.org/jira/browse/SPARK-13671
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we use PhysicalRDD for both existing RDD and data sources, they 
> are becoming much different, we should use different physical plans for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13603) SQL generation for subquery

2016-03-04 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-13603.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11453
[https://github.com/apache/spark/pull/11453]

> SQL generation for subquery
> ---
>
> Key: SPARK-13603
> URL: https://issues.apache.org/jira/browse/SPARK-13603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> Generate SQL for subquery expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB

2016-03-04 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13672:


 Summary: Add python examples of BisectingKMeans in ML and MLLIB
 Key: SPARK-13672
 URL: https://issues.apache.org/jira/browse/SPARK-13672
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: zhengruifeng
Priority: Trivial


add the missing python examples of BisectingKMeans for ml and mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13672:


Assignee: (was: Apache Spark)

> Add python examples of BisectingKMeans in ML and MLLIB
> --
>
> Key: SPARK-13672
> URL: https://issues.apache.org/jira/browse/SPARK-13672
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: zhengruifeng
>Priority: Trivial
>
> add the missing python examples of BisectingKMeans for ml and mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13672:


Assignee: Apache Spark

> Add python examples of BisectingKMeans in ML and MLLIB
> --
>
> Key: SPARK-13672
> URL: https://issues.apache.org/jira/browse/SPARK-13672
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> add the missing python examples of BisectingKMeans for ml and mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13672) Add python examples of BisectingKMeans in ML and MLLIB

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179555#comment-15179555
 ] 

Apache Spark commented on SPARK-13672:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11515

> Add python examples of BisectingKMeans in ML and MLLIB
> --
>
> Key: SPARK-13672
> URL: https://issues.apache.org/jira/browse/SPARK-13672
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: zhengruifeng
>Priority: Trivial
>
> add the missing python examples of BisectingKMeans for ml and mllib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.

2016-03-04 Thread Masayoshi TSUZUKI (JIRA)
Masayoshi TSUZUKI created SPARK-13673:
-

 Summary: script bin\beeline.cmd pollutes environment variables in 
Windows.
 Key: SPARK-13673
 URL: https://issues.apache.org/jira/browse/SPARK-13673
 Project: Spark
  Issue Type: Improvement
  Components: Windows
Affects Versions: 1.6.0
 Environment: Windows 8.1
Reporter: Masayoshi TSUZUKI
Priority: Minor


{{bin\beeline.cmd}} pollutes environment variables in Windows.
The similar problem is reported and fixed in [SPARK-3943], but 
{{bin\beeline.cmd}} is added later.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13629) Add binary toggle Param to CountVectorizer

2016-03-04 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179571#comment-15179571
 ] 

Nick Pentreath commented on SPARK-13629:


Only the word count would be set to 1 (for non-zero count).

> Add binary toggle Param to CountVectorizer
> --
>
> Key: SPARK-13629
> URL: https://issues.apache.org/jira/browse/SPARK-13629
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> It would be handy to add a binary toggle Param to CountVectorizer, as in the 
> scikit-learn one: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html]
> If set, then all non-zero counts will be set to 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13673:


Assignee: (was: Apache Spark)

> script bin\beeline.cmd pollutes environment variables in Windows.
> -
>
> Key: SPARK-13673
> URL: https://issues.apache.org/jira/browse/SPARK-13673
> Project: Spark
>  Issue Type: Improvement
>  Components: Windows
>Affects Versions: 1.6.0
> Environment: Windows 8.1
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> {{bin\beeline.cmd}} pollutes environment variables in Windows.
> The similar problem is reported and fixed in [SPARK-3943], but 
> {{bin\beeline.cmd}} is added later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179578#comment-15179578
 ] 

Apache Spark commented on SPARK-13673:
--

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/11516

> script bin\beeline.cmd pollutes environment variables in Windows.
> -
>
> Key: SPARK-13673
> URL: https://issues.apache.org/jira/browse/SPARK-13673
> Project: Spark
>  Issue Type: Improvement
>  Components: Windows
>Affects Versions: 1.6.0
> Environment: Windows 8.1
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> {{bin\beeline.cmd}} pollutes environment variables in Windows.
> The similar problem is reported and fixed in [SPARK-3943], but 
> {{bin\beeline.cmd}} is added later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13673:


Assignee: Apache Spark

> script bin\beeline.cmd pollutes environment variables in Windows.
> -
>
> Key: SPARK-13673
> URL: https://issues.apache.org/jira/browse/SPARK-13673
> Project: Spark
>  Issue Type: Improvement
>  Components: Windows
>Affects Versions: 1.6.0
> Environment: Windows 8.1
>Reporter: Masayoshi TSUZUKI
>Assignee: Apache Spark
>Priority: Minor
>
> {{bin\beeline.cmd}} pollutes environment variables in Windows.
> The similar problem is reported and fixed in [SPARK-3943], but 
> {{bin\beeline.cmd}} is added later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13652) TransportClient.sendRpcSync returns wrong results

2016-03-04 Thread huangyu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangyu closed SPARK-13652.
---

This issue has been fixed by Shixiong Zhu

> TransportClient.sendRpcSync returns wrong results
> -
>
> Key: SPARK-13652
> URL: https://issues.apache.org/jira/browse/SPARK-13652
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: huangyu
>Assignee: Shixiong Zhu
> Fix For: 1.6.2, 2.0.0
>
> Attachments: RankHandler.java, Test.java
>
>
> TransportClient is not thread safe and if it is called from multiple threads, 
> the messages can't be encoded and decoded correctly. Below is my code,and it 
> will print wrong message.
> {code}
> public static void main(String[] args) throws IOException, 
> InterruptedException {
> TransportServer server = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> RankHandler()).
> createServer(8081, new 
> LinkedList());
> TransportContext context = new TransportContext(new 
> TransportConf("test",
> new MapConfigProvider(new HashMap())), new 
> NoOpRpcHandler(), true);
> final TransportClientFactory clientFactory = 
> context.createClientFactory();
> List ts = new ArrayList<>();
> for (int i = 0; i < 10; i++) {
> ts.add(new Thread(new Runnable() {
> @Override
> public void run() {
> for (int j = 0; j < 1000; j++) {
> try {
> ByteBuf buf = Unpooled.buffer(8);
> buf.writeLong((long) j);
> ByteBuffer byteBuffer = 
> clientFactory.createClient("localhost", 8081).
> sendRpcSync(buf.nioBuffer(), 
> Long.MAX_VALUE);
> long response = byteBuffer.getLong();
> if (response != j) {
> System.err.println("send:" + j + ",response:" 
> + response);
> }
> } catch (IOException e) {
> e.printStackTrace();
> }
> }
> }
> }));
> ts.get(i).start();
> }
> for (Thread t : ts) {
> t.join();
> }
> server.close();
> }
> public class RankHandler extends RpcHandler {
> private final Logger logger = LoggerFactory.getLogger(RankHandler.class);
> private final StreamManager streamManager;
> public RankHandler() {
> this.streamManager = new OneForOneStreamManager();
> }
> @Override
> public void receive(TransportClient client, ByteBuffer msg, 
> RpcResponseCallback callback) {
> callback.onSuccess(msg);
> }
> @Override
> public StreamManager getStreamManager() {
> return streamManager;
> }
> }
> {code}
> it will print as below
> send:221,response:222
> send:233,response:234
> send:312,response:313
> send:358,response:359
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179593#comment-15179593
 ] 

Sean Owen commented on SPARK-13663:
---

OK to update for master/1.6

> Upgrade Snappy Java to 1.1.2.1
> --
>
> Key: SPARK-13663
> URL: https://issues.apache.org/jira/browse/SPARK-13663
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
>
> The JVM memory leaky problem reported in 
> https://github.com/xerial/snappy-java/issues/131 has been resolved.
> 1.1.2.1 was released on Jan 22nd.
> We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13674) Add wholestage codegen support to Sample

2016-03-04 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-13674:
---

 Summary: Add wholestage codegen support to Sample
 Key: SPARK-13674
 URL: https://issues.apache.org/jira/browse/SPARK-13674
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


Sample operator doesn't support wholestage codegen now. This issue is opened to 
add support for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13674) Add wholestage codegen support to Sample

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13674:


Assignee: (was: Apache Spark)

> Add wholestage codegen support to Sample
> 
>
> Key: SPARK-13674
> URL: https://issues.apache.org/jira/browse/SPARK-13674
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Sample operator doesn't support wholestage codegen now. This issue is opened 
> to add support for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13674) Add wholestage codegen support to Sample

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13674:


Assignee: Apache Spark

> Add wholestage codegen support to Sample
> 
>
> Key: SPARK-13674
> URL: https://issues.apache.org/jira/browse/SPARK-13674
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Sample operator doesn't support wholestage codegen now. This issue is opened 
> to add support for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13674) Add wholestage codegen support to Sample

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179639#comment-15179639
 ] 

Apache Spark commented on SPARK-13674:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/11517

> Add wholestage codegen support to Sample
> 
>
> Key: SPARK-13674
> URL: https://issues.apache.org/jira/browse/SPARK-13674
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Sample operator doesn't support wholestage codegen now. This issue is opened 
> to add support for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13646) QuantileDiscretizer counts dataset twice in getSampledInput

2016-03-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13646.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11491
[https://github.com/apache/spark/pull/11491]

> QuantileDiscretizer counts dataset twice in getSampledInput
> ---
>
> Key: SPARK-13646
> URL: https://issues.apache.org/jira/browse/SPARK-13646
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Abou Haydar Elias
>Priority: Trivial
>  Labels: patch, performance
> Fix For: 2.0.0
>
>
> getSampledInput counts the dataset twice as you see here : 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala#L116]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13646) QuantileDiscretizer counts dataset twice in getSampledInput

2016-03-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13646:
--
Assignee: Abou Haydar Elias

> QuantileDiscretizer counts dataset twice in getSampledInput
> ---
>
> Key: SPARK-13646
> URL: https://issues.apache.org/jira/browse/SPARK-13646
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Abou Haydar Elias
>Assignee: Abou Haydar Elias
>Priority: Trivial
>  Labels: patch, performance
> Fix For: 2.0.0
>
>
> getSampledInput counts the dataset twice as you see here : 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala#L116]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode

2016-03-04 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-13675:
---

 Summary: The url link in historypage is not correct for 
application running in yarn cluster mode
 Key: SPARK-13675
 URL: https://issues.apache.org/jira/browse/SPARK-13675
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Saisai Shao


Current URL for each application to access history UI is like: 
http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
http://localhost:18080/history/application_1457058760338_0016/2/jobs/

Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it 
will parse to attempt id in {{HistoryServer}}, while the correct attempt id 
should be like "appattempt_1457058760338_0016_02", so it will failed to 
parse to a correct attempt id in {{HistoryServer}}.

This is OK in yarn client mode, since we don't need this attempt id to fetch 
out the app cache, but it is failed in yarn cluster mode, where attempt id "1" 
or "2" is actually wrong.

So here we should fix this url to parse the correct application id and attempt 
id.

This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode

2016-03-04 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13675:

Description: 
Current URL for each application to access history UI is like: 
http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
http://localhost:18080/history/application_1457058760338_0016/2/jobs/

Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it 
will parse to attempt id in {{HistoryServer}}, while the correct attempt id 
should be like "appattempt_1457058760338_0016_02", so it will fail to parse 
to a correct attempt id in {{HistoryServer}}.

This is OK in yarn client mode, since we don't need this attempt id to fetch 
out the app cache, but it is failed in yarn cluster mode, where attempt id "1" 
or "2" is actually wrong.

So here we should fix this url to parse the correct application id and attempt 
id.

This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.

  was:
Current URL for each application to access history UI is like: 
http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
http://localhost:18080/history/application_1457058760338_0016/2/jobs/

Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it 
will parse to attempt id in {{HistoryServer}}, while the correct attempt id 
should be like "appattempt_1457058760338_0016_02", so it will failed to 
parse to a correct attempt id in {{HistoryServer}}.

This is OK in yarn client mode, since we don't need this attempt id to fetch 
out the app cache, but it is failed in yarn cluster mode, where attempt id "1" 
or "2" is actually wrong.

So here we should fix this url to parse the correct application id and attempt 
id.

This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.


> The url link in historypage is not correct for application running in yarn 
> cluster mode
> ---
>
> Key: SPARK-13675
> URL: https://issues.apache.org/jira/browse/SPARK-13675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>
> Current URL for each application to access history UI is like: 
> http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
> http://localhost:18080/history/application_1457058760338_0016/2/jobs/
> Here *1* or *2* represents the number of attempts in {{historypage.js}}, but 
> it will parse to attempt id in {{HistoryServer}}, while the correct attempt 
> id should be like "appattempt_1457058760338_0016_02", so it will fail to 
> parse to a correct attempt id in {{HistoryServer}}.
> This is OK in yarn client mode, since we don't need this attempt id to fetch 
> out the app cache, but it is failed in yarn cluster mode, where attempt id 
> "1" or "2" is actually wrong.
> So here we should fix this url to parse the correct application id and 
> attempt id.
> This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode

2016-03-04 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13675:

Description: 
Current URL for each application to access history UI is like: 
http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
http://localhost:18080/history/application_1457058760338_0016/2/jobs/

Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it 
will parse to attempt id in {{HistoryServer}}, while the correct attempt id 
should be like "appattempt_1457058760338_0016_02", so it will fail to parse 
to a correct attempt id in {{HistoryServer}}.

This is OK in yarn client mode, since we don't need this attempt id to fetch 
out the app cache, but it is failed in yarn cluster mode, where attempt id "1" 
or "2" is actually wrong.

So here we should fix this url to parse the correct application id and attempt 
id.

This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.

Here is the screenshot:

!https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png!

  was:
Current URL for each application to access history UI is like: 
http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
http://localhost:18080/history/application_1457058760338_0016/2/jobs/

Here *1* or *2* represents the number of attempts in {{historypage.js}}, but it 
will parse to attempt id in {{HistoryServer}}, while the correct attempt id 
should be like "appattempt_1457058760338_0016_02", so it will fail to parse 
to a correct attempt id in {{HistoryServer}}.

This is OK in yarn client mode, since we don't need this attempt id to fetch 
out the app cache, but it is failed in yarn cluster mode, where attempt id "1" 
or "2" is actually wrong.

So here we should fix this url to parse the correct application id and attempt 
id.

This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.


> The url link in historypage is not correct for application running in yarn 
> cluster mode
> ---
>
> Key: SPARK-13675
> URL: https://issues.apache.org/jira/browse/SPARK-13675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
> Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png
>
>
> Current URL for each application to access history UI is like: 
> http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
> http://localhost:18080/history/application_1457058760338_0016/2/jobs/
> Here *1* or *2* represents the number of attempts in {{historypage.js}}, but 
> it will parse to attempt id in {{HistoryServer}}, while the correct attempt 
> id should be like "appattempt_1457058760338_0016_02", so it will fail to 
> parse to a correct attempt id in {{HistoryServer}}.
> This is OK in yarn client mode, since we don't need this attempt id to fetch 
> out the app cache, but it is failed in yarn cluster mode, where attempt id 
> "1" or "2" is actually wrong.
> So here we should fix this url to parse the correct application id and 
> attempt id.
> This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.
> Here is the screenshot:
> !https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode

2016-03-04 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-13675:

Attachment: Screen Shot 2016-02-29 at 3.57.32 PM.png

> The url link in historypage is not correct for application running in yarn 
> cluster mode
> ---
>
> Key: SPARK-13675
> URL: https://issues.apache.org/jira/browse/SPARK-13675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
> Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png
>
>
> Current URL for each application to access history UI is like: 
> http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
> http://localhost:18080/history/application_1457058760338_0016/2/jobs/
> Here *1* or *2* represents the number of attempts in {{historypage.js}}, but 
> it will parse to attempt id in {{HistoryServer}}, while the correct attempt 
> id should be like "appattempt_1457058760338_0016_02", so it will fail to 
> parse to a correct attempt id in {{HistoryServer}}.
> This is OK in yarn client mode, since we don't need this attempt id to fetch 
> out the app cache, but it is failed in yarn cluster mode, where attempt id 
> "1" or "2" is actually wrong.
> So here we should fix this url to parse the correct application id and 
> attempt id.
> This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13675:


Assignee: (was: Apache Spark)

> The url link in historypage is not correct for application running in yarn 
> cluster mode
> ---
>
> Key: SPARK-13675
> URL: https://issues.apache.org/jira/browse/SPARK-13675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
> Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png
>
>
> Current URL for each application to access history UI is like: 
> http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
> http://localhost:18080/history/application_1457058760338_0016/2/jobs/
> Here *1* or *2* represents the number of attempts in {{historypage.js}}, but 
> it will parse to attempt id in {{HistoryServer}}, while the correct attempt 
> id should be like "appattempt_1457058760338_0016_02", so it will fail to 
> parse to a correct attempt id in {{HistoryServer}}.
> This is OK in yarn client mode, since we don't need this attempt id to fetch 
> out the app cache, but it is failed in yarn cluster mode, where attempt id 
> "1" or "2" is actually wrong.
> So here we should fix this url to parse the correct application id and 
> attempt id.
> This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.
> Here is the screenshot:
> !https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13675:


Assignee: Apache Spark

> The url link in historypage is not correct for application running in yarn 
> cluster mode
> ---
>
> Key: SPARK-13675
> URL: https://issues.apache.org/jira/browse/SPARK-13675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
> Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png
>
>
> Current URL for each application to access history UI is like: 
> http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
> http://localhost:18080/history/application_1457058760338_0016/2/jobs/
> Here *1* or *2* represents the number of attempts in {{historypage.js}}, but 
> it will parse to attempt id in {{HistoryServer}}, while the correct attempt 
> id should be like "appattempt_1457058760338_0016_02", so it will fail to 
> parse to a correct attempt id in {{HistoryServer}}.
> This is OK in yarn client mode, since we don't need this attempt id to fetch 
> out the app cache, but it is failed in yarn cluster mode, where attempt id 
> "1" or "2" is actually wrong.
> So here we should fix this url to parse the correct application id and 
> attempt id.
> This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.
> Here is the screenshot:
> !https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13675) The url link in historypage is not correct for application running in yarn cluster mode

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179683#comment-15179683
 ] 

Apache Spark commented on SPARK-13675:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/11518

> The url link in historypage is not correct for application running in yarn 
> cluster mode
> ---
>
> Key: SPARK-13675
> URL: https://issues.apache.org/jira/browse/SPARK-13675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
> Attachments: Screen Shot 2016-02-29 at 3.57.32 PM.png
>
>
> Current URL for each application to access history UI is like: 
> http://localhost:18080/history/application_1457058760338_0016/1/jobs/ or 
> http://localhost:18080/history/application_1457058760338_0016/2/jobs/
> Here *1* or *2* represents the number of attempts in {{historypage.js}}, but 
> it will parse to attempt id in {{HistoryServer}}, while the correct attempt 
> id should be like "appattempt_1457058760338_0016_02", so it will fail to 
> parse to a correct attempt id in {{HistoryServer}}.
> This is OK in yarn client mode, since we don't need this attempt id to fetch 
> out the app cache, but it is failed in yarn cluster mode, where attempt id 
> "1" or "2" is actually wrong.
> So here we should fix this url to parse the correct application id and 
> attempt id.
> This bug is newly introduced in SPARK-10873, there's no issue in branch 1.6.
> Here is the screenshot:
> !https://issues.apache.org/jira/secure/attachment/12791437/Screen%20Shot%202016-02-29%20at%203.57.32%20PM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13398) Move away from deprecated ThreadPoolTaskSupport

2016-03-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13398.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11423
[https://github.com/apache/spark/pull/11423]

> Move away from deprecated ThreadPoolTaskSupport
> ---
>
> Key: SPARK-13398
> URL: https://issues.apache.org/jira/browse/SPARK-13398
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: holdenk
>Priority: Trivial
> Fix For: 2.0.0
>
>
> ThreadPoolTaskSupport has been replaced by ForkJoinTaskSupport



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13398) Move away from deprecated ThreadPoolTaskSupport

2016-03-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13398:
--
Assignee: holdenk

> Move away from deprecated ThreadPoolTaskSupport
> ---
>
> Key: SPARK-13398
> URL: https://issues.apache.org/jira/browse/SPARK-13398
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
> Fix For: 2.0.0
>
>
> ThreadPoolTaskSupport has been replaced by ForkJoinTaskSupport



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12925) Improve HiveInspectors.unwrap for StringObjectInspector.getPrimitiveWritableObject

2016-03-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12925:
--
Priority: Minor  (was: Major)

> Improve HiveInspectors.unwrap for 
> StringObjectInspector.getPrimitiveWritableObject
> --
>
> Key: SPARK-12925
> URL: https://issues.apache.org/jira/browse/SPARK-12925
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: SPARK-12925_profiler_cpu_samples.png
>
>
> Text is in UTF-8 and converting it via "UTF8String.fromString" incurs 
> decoding and encoding, which turns out to be expensive. (to be specific: 
> https://github.com/apache/spark/blob/0d543b98f3e3da5053f0476f4647a765460861f3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L323)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression

2016-03-04 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-13676:
-

 Summary: Fix mismatched default values for regParam in 
LogisticRegression
 Key: SPARK-13676
 URL: https://issues.apache.org/jira/browse/SPARK-13676
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Dongjoon Hyun


The default value of regularization parameter for `LogisticRegression` 
algorithm is different in Scala and Python. We should provide the same value.

{code:title=Scala|borderStyle=solid}
scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
res0: Double = 0.0
{code}

{code:title=Python|borderStyle=solid}
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression

2016-03-04 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-13676:
--
Component/s: MLlib

> Fix mismatched default values for regParam in LogisticRegression
> 
>
> Key: SPARK-13676
> URL: https://issues.apache.org/jira/browse/SPARK-13676
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Dongjoon Hyun
>
> The default value of regularization parameter for `LogisticRegression` 
> algorithm is different in Scala and Python. We should provide the same value.
> {code:title=Scala|borderStyle=solid}
> scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
> res0: Double = 0.0
> {code}
> {code:title=Python|borderStyle=solid}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> LogisticRegression().getRegParam()
> 0.1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179792#comment-15179792
 ] 

Apache Spark commented on SPARK-13676:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/11519

> Fix mismatched default values for regParam in LogisticRegression
> 
>
> Key: SPARK-13676
> URL: https://issues.apache.org/jira/browse/SPARK-13676
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Dongjoon Hyun
>
> The default value of regularization parameter for `LogisticRegression` 
> algorithm is different in Scala and Python. We should provide the same value.
> {code:title=Scala|borderStyle=solid}
> scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
> res0: Double = 0.0
> {code}
> {code:title=Python|borderStyle=solid}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> LogisticRegression().getRegParam()
> 0.1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13676:


Assignee: (was: Apache Spark)

> Fix mismatched default values for regParam in LogisticRegression
> 
>
> Key: SPARK-13676
> URL: https://issues.apache.org/jira/browse/SPARK-13676
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Dongjoon Hyun
>
> The default value of regularization parameter for `LogisticRegression` 
> algorithm is different in Scala and Python. We should provide the same value.
> {code:title=Scala|borderStyle=solid}
> scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
> res0: Double = 0.0
> {code}
> {code:title=Python|borderStyle=solid}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> LogisticRegression().getRegParam()
> 0.1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13676:


Assignee: Apache Spark

> Fix mismatched default values for regParam in LogisticRegression
> 
>
> Key: SPARK-13676
> URL: https://issues.apache.org/jira/browse/SPARK-13676
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> The default value of regularization parameter for `LogisticRegression` 
> algorithm is different in Scala and Python. We should provide the same value.
> {code:title=Scala|borderStyle=solid}
> scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
> res0: Double = 0.0
> {code}
> {code:title=Python|borderStyle=solid}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> LogisticRegression().getRegParam()
> 0.1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs

2016-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179796#comment-15179796
 ] 

Sean Owen commented on SPARK-13596:
---

[~nchammas] do you happen to know how we can configure stuff to expect 
{{tox.ini}} in the {{python}} directory instead? I'm trying to clean up the top 
level.

> Move misc top-level build files into appropriate subdirs
> 
>
> Key: SPARK-13596
> URL: https://issues.apache.org/jira/browse/SPARK-13596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> I'd like to file away a bunch of misc files that are in the top level of the 
> project in order to further tidy the build for 2.0.0. See also SPARK-13529, 
> SPARK-13548.
> Some of these may turn out to be difficult or impossible to move.
> I'd ideally like to move these files into {{build/}}:
> - {{.rat-excludes}}
> - {{checkstyle.xml}}
> - {{checkstyle-suppressions.xml}}
> - {{pylintrc}}
> - {{scalastyle-config.xml}}
> - {{tox.ini}}
> - {{project/}} (or does SBT need this in the root?)
> And ideally, these would go under {{dev/}}
> - {{make-distribution.sh}}
> And remove these
> - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?)
> Edited to add: apparently this can go in {{.github}} now:
> - {{CONTRIBUTING.md}}
> Other files in the top level seem to need to be there, like {{README.md}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13677) Support Tree-Based Feature Transformation for mllib

2016-03-04 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-13677:


 Summary: Support Tree-Based Feature Transformation for mllib
 Key: SPARK-13677
 URL: https://issues.apache.org/jira/browse/SPARK-13677
 Project: Spark
  Issue Type: New Feature
Reporter: zhengruifeng
Priority: Minor


It would be nice to be able to use RF and GBT for feature transformation:
First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
the training set. Then each leaf of each tree in the ensemble is assigned a 
fixed arbitrary feature index in a new feature space. These leaf indices are 
then encoded in a one-hot fashion.

This method was first introduced by 
facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
implemented in two famous library:
sklearn 
(http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
xgboost 
(https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)

I have implement it in mllib:

val features : RDD[Vector] = ...
val model1 : RandomForestModel = ...
val transformed1 : RDD[Vector] = model1.leaf(features)

val model2 : GradientBoostedTreesModel = ...
val transformed2 : RDD[Vector] = model2.leaf(features)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13677) Support Tree-Based Feature Transformation for mllib

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13677:


Assignee: Apache Spark

> Support Tree-Based Feature Transformation for mllib
> ---
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
> First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
> implemented in two famous library:
> sklearn 
> (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
> xgboost 
> (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)
> I have implement it in mllib:
> val features : RDD[Vector] = ...
> val model1 : RandomForestModel = ...
> val transformed1 : RDD[Vector] = model1.leaf(features)
> val model2 : GradientBoostedTreesModel = ...
> val transformed2 : RDD[Vector] = model2.leaf(features)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13677) Support Tree-Based Feature Transformation for mllib

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179827#comment-15179827
 ] 

Apache Spark commented on SPARK-13677:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/11520

> Support Tree-Based Feature Transformation for mllib
> ---
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
> First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
> implemented in two famous library:
> sklearn 
> (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
> xgboost 
> (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)
> I have implement it in mllib:
> val features : RDD[Vector] = ...
> val model1 : RandomForestModel = ...
> val transformed1 : RDD[Vector] = model1.leaf(features)
> val model2 : GradientBoostedTreesModel = ...
> val transformed2 : RDD[Vector] = model2.leaf(features)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13677) Support Tree-Based Feature Transformation for mllib

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13677:


Assignee: (was: Apache Spark)

> Support Tree-Based Feature Transformation for mllib
> ---
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>Reporter: zhengruifeng
>Priority: Minor
>
> It would be nice to be able to use RF and GBT for feature transformation:
> First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook(http://www.herbrich.me/papers/adclicksfacebook.pdf), and is 
> implemented in two famous library:
> sklearn 
> (http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py)
> xgboost 
> (https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py)
> I have implement it in mllib:
> val features : RDD[Vector] = ...
> val model1 : RandomForestModel = ...
> val transformed1 : RDD[Vector] = model1.leaf(features)
> val model2 : GradientBoostedTreesModel = ...
> val transformed2 : RDD[Vector] = model2.leaf(features)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13489) GSoC 2016 project ideas for MLlib

2016-03-04 Thread Kai Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179833#comment-15179833
 ] 

Kai Jiang commented on SPARK-13489:
---

[~josephkb] Thanks for your explanation! It seems like there are lots of 
missing models in SparkR. I opened a google docs 
([link|https://docs.google.com/document/d/15h1IbuGJMQvqCU7kALZ4Qr6tZPPqI2hgXTYnJPIiFXg/edit?usp=sharing])
 and put some ideas into it. Do you mind giving some suggestions about whether 
those ideas are suitable for GSoC project? cc [~mengxr] [~mlnick]

> GSoC 2016 project ideas for MLlib
> -
>
> Key: SPARK-13489
> URL: https://issues.apache.org/jira/browse/SPARK-13489
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> I want to use this JIRA to collect some GSoC project ideas for MLlib. 
> Ideally, the student should have contributed to Spark. And the content of the 
> project could be divided into small functional pieces so that it won't get 
> stalled if the mentor is temporarily unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13678) transformExpressions should exclude expression that is not inside QueryPlan.expressions

2016-03-04 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13678:
---

 Summary: transformExpressions should exclude expression that is 
not inside QueryPlan.expressions
 Key: SPARK-13678
 URL: https://issues.apache.org/jira/browse/SPARK-13678
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13678) transformExpressions should only apply on QueryPlan.expressions

2016-03-04 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13678:

Summary: transformExpressions should only apply on QueryPlan.expressions  
(was: transformExpressions should exclude expression that is not inside 
QueryPlan.expressions)

> transformExpressions should only apply on QueryPlan.expressions
> ---
>
> Key: SPARK-13678
> URL: https://issues.apache.org/jira/browse/SPARK-13678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13678) transformExpressions should only apply on QueryPlan.expressions

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13678:


Assignee: Apache Spark

> transformExpressions should only apply on QueryPlan.expressions
> ---
>
> Key: SPARK-13678
> URL: https://issues.apache.org/jira/browse/SPARK-13678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13678) transformExpressions should only apply on QueryPlan.expressions

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179871#comment-15179871
 ] 

Apache Spark commented on SPARK-13678:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/11521

> transformExpressions should only apply on QueryPlan.expressions
> ---
>
> Key: SPARK-13678
> URL: https://issues.apache.org/jira/browse/SPARK-13678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13678) transformExpressions should only apply on QueryPlan.expressions

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13678:


Assignee: (was: Apache Spark)

> transformExpressions should only apply on QueryPlan.expressions
> ---
>
> Key: SPARK-13678
> URL: https://issues.apache.org/jira/browse/SPARK-13678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13596) Move misc top-level build files into appropriate subdirs

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13596:


Assignee: Apache Spark

> Move misc top-level build files into appropriate subdirs
> 
>
> Key: SPARK-13596
> URL: https://issues.apache.org/jira/browse/SPARK-13596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>
> I'd like to file away a bunch of misc files that are in the top level of the 
> project in order to further tidy the build for 2.0.0. See also SPARK-13529, 
> SPARK-13548.
> Some of these may turn out to be difficult or impossible to move.
> I'd ideally like to move these files into {{build/}}:
> - {{.rat-excludes}}
> - {{checkstyle.xml}}
> - {{checkstyle-suppressions.xml}}
> - {{pylintrc}}
> - {{scalastyle-config.xml}}
> - {{tox.ini}}
> - {{project/}} (or does SBT need this in the root?)
> And ideally, these would go under {{dev/}}
> - {{make-distribution.sh}}
> And remove these
> - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?)
> Edited to add: apparently this can go in {{.github}} now:
> - {{CONTRIBUTING.md}}
> Other files in the top level seem to need to be there, like {{README.md}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15179894#comment-15179894
 ] 

Apache Spark commented on SPARK-13596:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11522

> Move misc top-level build files into appropriate subdirs
> 
>
> Key: SPARK-13596
> URL: https://issues.apache.org/jira/browse/SPARK-13596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> I'd like to file away a bunch of misc files that are in the top level of the 
> project in order to further tidy the build for 2.0.0. See also SPARK-13529, 
> SPARK-13548.
> Some of these may turn out to be difficult or impossible to move.
> I'd ideally like to move these files into {{build/}}:
> - {{.rat-excludes}}
> - {{checkstyle.xml}}
> - {{checkstyle-suppressions.xml}}
> - {{pylintrc}}
> - {{scalastyle-config.xml}}
> - {{tox.ini}}
> - {{project/}} (or does SBT need this in the root?)
> And ideally, these would go under {{dev/}}
> - {{make-distribution.sh}}
> And remove these
> - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?)
> Edited to add: apparently this can go in {{.github}} now:
> - {{CONTRIBUTING.md}}
> Other files in the top level seem to need to be there, like {{README.md}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13596) Move misc top-level build files into appropriate subdirs

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13596:


Assignee: (was: Apache Spark)

> Move misc top-level build files into appropriate subdirs
> 
>
> Key: SPARK-13596
> URL: https://issues.apache.org/jira/browse/SPARK-13596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> I'd like to file away a bunch of misc files that are in the top level of the 
> project in order to further tidy the build for 2.0.0. See also SPARK-13529, 
> SPARK-13548.
> Some of these may turn out to be difficult or impossible to move.
> I'd ideally like to move these files into {{build/}}:
> - {{.rat-excludes}}
> - {{checkstyle.xml}}
> - {{checkstyle-suppressions.xml}}
> - {{pylintrc}}
> - {{scalastyle-config.xml}}
> - {{tox.ini}}
> - {{project/}} (or does SBT need this in the root?)
> And ideally, these would go under {{dev/}}
> - {{make-distribution.sh}}
> And remove these
> - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?)
> Edited to add: apparently this can go in {{.github}} now:
> - {{CONTRIBUTING.md}}
> Other files in the top level seem to need to be there, like {{README.md}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.

2016-03-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-13673:
--
Assignee: Masayoshi TSUZUKI

> script bin\beeline.cmd pollutes environment variables in Windows.
> -
>
> Key: SPARK-13673
> URL: https://issues.apache.org/jira/browse/SPARK-13673
> Project: Spark
>  Issue Type: Improvement
>  Components: Windows
>Affects Versions: 1.6.0
> Environment: Windows 8.1
>Reporter: Masayoshi TSUZUKI
>Assignee: Masayoshi TSUZUKI
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{bin\beeline.cmd}} pollutes environment variables in Windows.
> The similar problem is reported and fixed in [SPARK-3943], but 
> {{bin\beeline.cmd}} is added later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13673) script bin\beeline.cmd pollutes environment variables in Windows.

2016-03-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13673.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11516
[https://github.com/apache/spark/pull/11516]

> script bin\beeline.cmd pollutes environment variables in Windows.
> -
>
> Key: SPARK-13673
> URL: https://issues.apache.org/jira/browse/SPARK-13673
> Project: Spark
>  Issue Type: Improvement
>  Components: Windows
>Affects Versions: 1.6.0
> Environment: Windows 8.1
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
> Fix For: 2.0.0
>
>
> {{bin\beeline.cmd}} pollutes environment variables in Windows.
> The similar problem is reported and fixed in [SPARK-3943], but 
> {{bin\beeline.cmd}} is added later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11515) QuantileDiscretizer should take random seed

2016-03-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11515:
--
Fix Version/s: 1.6.2

> QuantileDiscretizer should take random seed
> ---
>
> Key: SPARK-11515
> URL: https://issues.apache.org/jira/browse/SPARK-11515
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Minor
> Fix For: 1.6.2, 2.0.0
>
>
> QuantileDiscretizer takes a random sample to select bins.  It currently does 
> not specify a seed for the XORShiftRandom, but it should take a seed by 
> extending the HasSeed Param.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-03-04 Thread Daniel Jouany (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180011#comment-15180011
 ] 

Daniel Jouany commented on SPARK-10795:
---

Hi there,
If i follow your suggestions, it works.

Our code was like that :

{{
Import numpy as np
Import SparkContext
foo = np.genfromtext(x)
sc=SparkContext(...)
#compute
}}

*===> It fails*

We have just moved the global variable initialization *after* the context init:

{{
Import numpy as np
Import SparkContext
global foo
sc=SparkContext(...)
foo = np.genfromtext(x)
#compute
}}
*===> It works perfectly*

Note that you could reproduce this behaviour with something else than a numpy 
call - eventhough not every statement does entail the crash.
The question is : why is this *non-spark* variable init interfering with the 
SparkContext 

> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-03-04 Thread Daniel Jouany (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180011#comment-15180011
 ] 

Daniel Jouany edited comment on SPARK-10795 at 3/4/16 3:30 PM:
---

Hi there,
If i follow your suggestions, it works.

Our code was like that :

{code}
Import numpy as np
Import SparkContext
foo = np.genfromtext(x)
sc=SparkContext(...)
#compute
{code}
*===> It fails*

We have just moved the global variable initialization *after* the context init:

{code}
Import numpy as np
Import SparkContext
global foo
sc=SparkContext(...)
foo = np.genfromtext(x)
#compute
{code}
*===> It works perfectly*

Note that you could reproduce this behaviour with something else than a numpy 
call - eventhough not every statement does entail the crash.
The question is : why is this *non-spark* variable init interfering with the 
SparkContext 


was (Author: djouany):
Hi there,
If i follow your suggestions, it works.

Our code was like that :

{{
Import numpy as np
Import SparkContext
foo = np.genfromtext(x)
sc=SparkContext(...)
#compute
}}

*===> It fails*

We have just moved the global variable initialization *after* the context init:

{{
Import numpy as np
Import SparkContext
global foo
sc=SparkContext(...)
foo = np.genfromtext(x)
#compute
}}
*===> It works perfectly*

Note that you could reproduce this behaviour with something else than a numpy 
call - eventhough not every statement does entail the crash.
The question is : why is this *non-spark* variable init interfering with the 
SparkContext 

> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3374) Spark on Yarn remove deprecated configs for 2.0

2016-03-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180027#comment-15180027
 ] 

Thomas Graves commented on SPARK-3374:
--

[~srowen]  can you add [~jerrypeng] as a contributor so he can assign himself 
to jira?

> Spark on Yarn remove deprecated configs for 2.0
> ---
>
> Key: SPARK-3374
> URL: https://issues.apache.org/jira/browse/SPARK-3374
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>
> The configs in yarn have gotten scattered and inconsistent between cluster 
> and client modes and supporting backwards compatibility.  We should try to 
> clean this up, move things to common places and support configs across both 
> cluster and client modes where we want to make them public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3374) Spark on Yarn remove deprecated configs for 2.0

2016-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180031#comment-15180031
 ] 

Sean Owen commented on SPARK-3374:
--

Yes and I'll make you an admin so you can assign.

> Spark on Yarn remove deprecated configs for 2.0
> ---
>
> Key: SPARK-3374
> URL: https://issues.apache.org/jira/browse/SPARK-3374
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>
> The configs in yarn have gotten scattered and inconsistent between cluster 
> and client modes and supporting backwards compatibility.  We should try to 
> clean this up, move things to common places and support configs across both 
> cluster and client modes where we want to make them public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13595) Move docker, extras modules into external

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180037#comment-15180037
 ] 

Apache Spark commented on SPARK-13595:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/11523

> Move docker, extras modules into external
> -
>
> Key: SPARK-13595
> URL: https://issues.apache.org/jira/browse/SPARK-13595
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Examples
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> See also SPARK-13529, SPARK-13548. In the same spirit [~rxin] I'd like to put 
> the {{docker}} and {{docker-integration-test}} modules, and everything under 
> {{extras}}, under {{external}}. This groups these pretty logically related 
> modules and removes three top-level dirs.
> I'll take a look at it and see if there are any complications that this would 
> entail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13595) Move docker, extras modules into external

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13595:


Assignee: (was: Apache Spark)

> Move docker, extras modules into external
> -
>
> Key: SPARK-13595
> URL: https://issues.apache.org/jira/browse/SPARK-13595
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Examples
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> See also SPARK-13529, SPARK-13548. In the same spirit [~rxin] I'd like to put 
> the {{docker}} and {{docker-integration-test}} modules, and everything under 
> {{extras}}, under {{external}}. This groups these pretty logically related 
> modules and removes three top-level dirs.
> I'll take a look at it and see if there are any complications that this would 
> entail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13595) Move docker, extras modules into external

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13595:


Assignee: Apache Spark

> Move docker, extras modules into external
> -
>
> Key: SPARK-13595
> URL: https://issues.apache.org/jira/browse/SPARK-13595
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Examples
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>
> See also SPARK-13529, SPARK-13548. In the same spirit [~rxin] I'd like to put 
> the {{docker}} and {{docker-integration-test}} modules, and everything under 
> {{extras}}, under {{external}}. This groups these pretty logically related 
> modules and removes three top-level dirs.
> I'll take a look at it and see if there are any complications that this would 
> entail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180047#comment-15180047
 ] 

Apache Spark commented on SPARK-13663:
--

User 'yy2016' has created a pull request for this issue:
https://github.com/apache/spark/pull/11524

> Upgrade Snappy Java to 1.1.2.1
> --
>
> Key: SPARK-13663
> URL: https://issues.apache.org/jira/browse/SPARK-13663
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
>
> The JVM memory leaky problem reported in 
> https://github.com/xerial/snappy-java/issues/131 has been resolved.
> 1.1.2.1 was released on Jan 22nd.
> We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13663:


Assignee: Apache Spark

> Upgrade Snappy Java to 1.1.2.1
> --
>
> Key: SPARK-13663
> URL: https://issues.apache.org/jira/browse/SPARK-13663
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Minor
>
> The JVM memory leaky problem reported in 
> https://github.com/xerial/snappy-java/issues/131 has been resolved.
> 1.1.2.1 was released on Jan 22nd.
> We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13663) Upgrade Snappy Java to 1.1.2.1

2016-03-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13663:


Assignee: (was: Apache Spark)

> Upgrade Snappy Java to 1.1.2.1
> --
>
> Key: SPARK-13663
> URL: https://issues.apache.org/jira/browse/SPARK-13663
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
>
> The JVM memory leaky problem reported in 
> https://github.com/xerial/snappy-java/issues/131 has been resolved.
> 1.1.2.1 was released on Jan 22nd.
> We should upgrade to this release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13679) Pyspark job fails with Oozie

2016-03-04 Thread Alexandre Linte (JIRA)
Alexandre Linte created SPARK-13679:
---

 Summary: Pyspark job fails with Oozie
 Key: SPARK-13679
 URL: https://issues.apache.org/jira/browse/SPARK-13679
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Submit, YARN
Affects Versions: 1.6.0
 Environment: Hadoop 2.7.2, Spark 1.6.0 on Yarn, Oozie 4.2.0
Cluster secured with Kerberos
Reporter: Alexandre Linte


Hello,

I'm trying to run pi.py example in a pyspark job with Oozie. Every try I made 
failed for the same reason: key not found: SPARK_HOME. 
Note: A scala job works well in the environment with Oozie. 

The logs on the executors are:
{noformat}
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/mnt/hd4/hadoop/yarn/local/filecache/145/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/mnt/hd2/hadoop/yarn/local/filecache/155/spark-assembly-1.6.0-hadoop2.7.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/opt/application/Hadoop/hadoop-2.7.2/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: 
/mnt/hd7/hadoop/yarn/log/application_1454673025841_13136/container_1454673025841_13136_01_01
 (Is a directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at java.io.FileOutputStream.(FileOutputStream.java:142)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
at 
org.apache.hadoop.yarn.ContainerLogAppender.activateOptions(ContainerLogAppender.java:55)
at 
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
at 
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
at 
org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
at 
org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:809)
at 
org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:735)
at 
org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:615)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:502)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:547)
at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:483)
at org.apache.log4j.LogManager.(LogManager.java:127)
at 
org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:285)
at 
org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:155)
at 
org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:132)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:275)
at 
org.apache.hadoop.service.AbstractService.(AbstractService.java:43)
Using properties file: null
Parsed arguments:
  master  yarn-master
  deployMode  cluster
  executorMemory  null
  executorCores   null
  totalExecutorCores  null
  propertiesFile  null
  driverMemorynull
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass   null
  primaryResource 
hdfs://hadoopsandbox/User/toto/WORK/Oozie/pyspark/lib/pi.py
  namePysparkpi example
  childArgs   [100]
  jarsnull
  packagesnull
  packagesExclusions  null
  repositoriesnull
  verbose true

Spark properties used, including those specified through
 --conf and those from the properties file null:
  spark.executorEnv.SPARK_HOME -> /opt/application/Spark/current
  spark.executorEnv.PYTHONPATH -> /opt/application/Spark/current/python
  spark.yarn.appMasterEnv.SPARK_HOME -> /opt/application/Spark/current


Main class:
org.apache.spark.deploy.yarn.Client
Arguments:
--name
Pysparkpi example
--primary-py-file
hdfs://hadoopsandbox/User/toto/WORK/Oozie/pyspark/lib/pi.py
--class
org.apache.spark.deploy.PythonRunner
--arg
100
System properties:
spark.executorEnv.SPARK_HOME -> /opt/application/Spark/curre

[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs

2016-03-04 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180072#comment-15180072
 ] 

Nicholas Chammas commented on SPARK-13596:
--

Looks like {{tox.ini}} is only used by {{pep8}}, so if you move it into 
{{dev/}}, where the Python lint checks run from, that should work.

> Move misc top-level build files into appropriate subdirs
> 
>
> Key: SPARK-13596
> URL: https://issues.apache.org/jira/browse/SPARK-13596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> I'd like to file away a bunch of misc files that are in the top level of the 
> project in order to further tidy the build for 2.0.0. See also SPARK-13529, 
> SPARK-13548.
> Some of these may turn out to be difficult or impossible to move.
> I'd ideally like to move these files into {{build/}}:
> - {{.rat-excludes}}
> - {{checkstyle.xml}}
> - {{checkstyle-suppressions.xml}}
> - {{pylintrc}}
> - {{scalastyle-config.xml}}
> - {{tox.ini}}
> - {{project/}} (or does SBT need this in the root?)
> And ideally, these would go under {{dev/}}
> - {{make-distribution.sh}}
> And remove these
> - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?)
> Edited to add: apparently this can go in {{.github}} now:
> - {{CONTRIBUTING.md}}
> Other files in the top level seem to need to be there, like {{README.md}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2016-03-04 Thread Yael Aharon (JIRA)
Yael Aharon created SPARK-13680:
---

 Summary: Java UDAF with more than one intermediate argument 
returns wrong results
 Key: SPARK-13680
 URL: https://issues.apache.org/jira/browse/SPARK-13680
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: CDH 5.5.2
Reporter: Yael Aharon


I am trying to incorporate the Java UDAF from 
https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
 into an SQL query. 
I registered the UDAF like this:
 sqlContext.udf().register("myavg", new MyDoubleAvg());

My SQL query is:
SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS `avg_stdevi`, 
MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS `max_ci`, MAX(cd) 
AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS `max_stdevi`, 
MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS `min_ci`, MIN(cd) 
AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS `min_stdevi`,SUM(seqi) 
AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS `sum_ci`, SUM(cd) AS 
`sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS `sum_stdevi`, myavg(seqd) 
as `myavg_seqd`,  AVG(zero) AS `avg_zero`, AVG(nulli) AS 
`avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, SUM(nulli) AS 
`sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, MAX(nulli) AS 
`max_nulli`,MAX(nulld) AS `max_nulld`,count(*) AS `count_all`, count(nulli) AS 
`count_nulli` FROM mytable

As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
When I remove the call to the UDAF, the results are correct.
I was able to go around the issue by modifying bufferSchema of the UDAF to use 
an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2016-03-04 Thread Yael Aharon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yael Aharon updated SPARK-13680:

Attachment: data.csv

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count(*) AS `count_all`, 
> count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13676.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11519
[https://github.com/apache/spark/pull/11519]

> Fix mismatched default values for regParam in LogisticRegression
> 
>
> Key: SPARK-13676
> URL: https://issues.apache.org/jira/browse/SPARK-13676
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> The default value of regularization parameter for `LogisticRegression` 
> algorithm is different in Scala and Python. We should provide the same value.
> {code:title=Scala|borderStyle=solid}
> scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
> res0: Double = 0.0
> {code}
> {code:title=Python|borderStyle=solid}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> LogisticRegression().getRegParam()
> 0.1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2016-03-04 Thread Yael Aharon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180096#comment-15180096
 ] 

Yael Aharon commented on SPARK-13680:
-

I attached data.csv which is the data used for this test

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count(*) AS `count_all`, 
> count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13676:
--
Target Version/s: 2.0.0

> Fix mismatched default values for regParam in LogisticRegression
> 
>
> Key: SPARK-13676
> URL: https://issues.apache.org/jira/browse/SPARK-13676
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> The default value of regularization parameter for `LogisticRegression` 
> algorithm is different in Scala and Python. We should provide the same value.
> {code:title=Scala|borderStyle=solid}
> scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
> res0: Double = 0.0
> {code}
> {code:title=Python|borderStyle=solid}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> LogisticRegression().getRegParam()
> 0.1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13676) Fix mismatched default values for regParam in LogisticRegression

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13676:
--
Assignee: Dongjoon Hyun

> Fix mismatched default values for regParam in LogisticRegression
> 
>
> Key: SPARK-13676
> URL: https://issues.apache.org/jira/browse/SPARK-13676
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> The default value of regularization parameter for `LogisticRegression` 
> algorithm is different in Scala and Python. We should provide the same value.
> {code:title=Scala|borderStyle=solid}
> scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
> res0: Double = 0.0
> {code}
> {code:title=Python|borderStyle=solid}
> >>> from pyspark.ml.classification import LogisticRegression
> >>> LogisticRegression().getRegParam()
> 0.1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13036) PySpark ml.feature support export/import

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13036.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11203
[https://github.com/apache/spark/pull/11203]

> PySpark ml.feature support export/import
> 
>
> Key: SPARK-13036
> URL: https://issues.apache.org/jira/browse/SPARK-13036
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/feature.py. Please refer the implementation 
> at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13318:
--
Target Version/s: 2.0.0

> Model export/import for spark.ml: ElementwiseProduct
> 
>
> Key: SPARK-13318
> URL: https://issues.apache.org/jira/browse/SPARK-13318
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add save/load to ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13319) Pyspark VectorSlicer, StopWordsRemvoer should have setDefault

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13319.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11203
[https://github.com/apache/spark/pull/11203]

> Pyspark VectorSlicer, StopWordsRemvoer should have setDefault
> -
>
> Key: SPARK-13319
> URL: https://issues.apache.org/jira/browse/SPARK-13319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Pyspark VectorSlicer should have setDefault, otherwise it will cause error 
> when calling getNames or getIndices.
> StopWordsRemover need to set defalut value for "caseSensitive".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13319) Pyspark VectorSlicer, StopWordsRemvoer should have setDefault

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13319:
--
Target Version/s: 2.0.0

> Pyspark VectorSlicer, StopWordsRemvoer should have setDefault
> -
>
> Key: SPARK-13319
> URL: https://issues.apache.org/jira/browse/SPARK-13319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Pyspark VectorSlicer should have setDefault, otherwise it will cause error 
> when calling getNames or getIndices.
> StopWordsRemover need to set defalut value for "caseSensitive".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13318.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11203
[https://github.com/apache/spark/pull/11203]

> Model export/import for spark.ml: ElementwiseProduct
> 
>
> Key: SPARK-13318
> URL: https://issues.apache.org/jira/browse/SPARK-13318
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add save/load to ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13319) Pyspark VectorSlicer, StopWordsRemvoer should have setDefault

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13319:
--
Assignee: Xusen Yin

> Pyspark VectorSlicer, StopWordsRemvoer should have setDefault
> -
>
> Key: SPARK-13319
> URL: https://issues.apache.org/jira/browse/SPARK-13319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Pyspark VectorSlicer should have setDefault, otherwise it will cause error 
> when calling getNames or getIndices.
> StopWordsRemover need to set defalut value for "caseSensitive".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13318) Model export/import for spark.ml: ElementwiseProduct

2016-03-04 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-13318:
--
Assignee: Xusen Yin

> Model export/import for spark.ml: ElementwiseProduct
> 
>
> Key: SPARK-13318
> URL: https://issues.apache.org/jira/browse/SPARK-13318
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add save/load to ElementwiseProduct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13494) Cannot sort on a column which is of type "array"

2016-03-04 Thread Yael Aharon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172115#comment-15172115
 ] 

Yael Aharon edited comment on SPARK-13494 at 3/4/16 4:35 PM:
-

I am using Spark 1.5 from Cloudera distribution CDH 5.5.2 . Do you think this 
was fixed since?

The Hive schema of the column in question is  array


was (Author: yael):
I am using Spark 5.2 from Cloudera distribution CDH 5.2 . Do you think this was 
fixed since?

The Hive schema of the column in question is  array

> Cannot sort on a column which is of type "array"
> 
>
> Key: SPARK-13494
> URL: https://issues.apache.org/jira/browse/SPARK-13494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yael Aharon
>
> Executing the following SQL results in an error if columnName refers to a 
> column of type array
> SELECT * FROM myTable ORDER BY columnName ASC LIMIT 50
> The error is 
> org.apache.spark.sql.AnalysisException: cannot resolve 'columnName ASC' due 
> to data type mismatch: cannot sort data type array



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2016-03-04 Thread Yael Aharon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yael Aharon updated SPARK-13680:

Description: 
I am trying to incorporate the Java UDAF from 
https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
 into an SQL query. 
I registered the UDAF like this:
 sqlContext.udf().register("myavg", new MyDoubleAvg());

My SQL query is:
SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS `avg_stdevi`, 
MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS `max_ci`, MAX(cd) 
AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS `max_stdevi`, 
MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS `min_ci`, MIN(cd) 
AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS `min_stdevi`,SUM(seqi) 
AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS `sum_ci`, SUM(cd) AS 
`sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS `sum_stdevi`, myavg(seqd) 
as `myavg_seqd`,  AVG(zero) AS `avg_zero`, AVG(nulli) AS 
`avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, SUM(nulli) AS 
`sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, MAX(nulli) AS 
`max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS `count_all`, count(nulli) 
AS `count_nulli` FROM mytable

As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
When I remove the call to the UDAF, the results are correct.
I was able to go around the issue by modifying bufferSchema of the UDAF to use 
an array and the corresponding update and merge methods. 

  was:
I am trying to incorporate the Java UDAF from 
https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
 into an SQL query. 
I registered the UDAF like this:
 sqlContext.udf().register("myavg", new MyDoubleAvg());

My SQL query is:
SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS `avg_stdevi`, 
MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS `max_ci`, MAX(cd) 
AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS `max_stdevi`, 
MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS `min_ci`, MIN(cd) 
AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS `min_stdevi`,SUM(seqi) 
AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS `sum_ci`, SUM(cd) AS 
`sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS `sum_stdevi`, myavg(seqd) 
as `myavg_seqd`,  AVG(zero) AS `avg_zero`, AVG(nulli) AS 
`avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, SUM(nulli) AS 
`sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, MAX(nulli) AS 
`max_nulli`,MAX(nulld) AS `max_nulld`,count(*) AS `count_all`, count(nulli) AS 
`count_nulli` FROM mytable

As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
When I remove the call to the UDAF, the results are correct.
I was able to go around the issue by modifying bufferSchema of the UDAF to use 
an array and the corresponding update and merge methods. 


> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS 
> `cou

[jira] [Comment Edited] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2016-03-04 Thread Yael Aharon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180120#comment-15180120
 ] 

Yael Aharon edited comment on SPARK-13680 at 3/4/16 4:42 PM:
-

I found this in the spark executor logs when running the MyDoubleAVG UDAF. 
Execution continued in spite of this exception:

java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getLong(rows.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getLong(rows.scala:247)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getLong(JoinedRow.scala:85)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply772_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:174)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:171)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:100)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:139)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:74)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


was (Author: yael):
I found this in the spark executor logs when running the MyDoubleAVG UDAF. 
Execution continued in spite of this exception:
java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getLong(rows.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getLong(rows.scala:247)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getLong(JoinedRow.scala:85)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply772_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:174)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:171)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:100)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:139)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:74)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.

[jira] [Commented] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2016-03-04 Thread Yael Aharon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180120#comment-15180120
 ] 

Yael Aharon commented on SPARK-13680:
-

I found this in the spark executor logs when running the MyDoubleAVG UDAF. 
Execution continued in spite of this exception:
java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getLong(rows.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getLong(rows.scala:247)
at 
org.apache.spark.sql.catalyst.expressions.JoinedRow.getLong(JoinedRow.scala:85)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply772_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:174)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$11.apply(AggregationIterator.scala:171)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.processCurrentSortedGroup(SortBasedAggregationIterator.scala:100)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:139)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:74)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS 
> `count_all`, count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216
 ] 

Łukasz Gieroń commented on SPARK-13230:
---

The issue here is the bug in Scala library, in deserialization of `HashMap1` 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017)
> at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1

[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216
 ] 

Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:34 PM:
---

The issue here is the bug in Scala library, in deserialization of 'HashMap1' 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).


was (Author: lgieron):
The issue here is the bug in Scala library, in deserialization of `HashMap1` 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:101

[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216
 ] 

Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:35 PM:
---

The issue here is the bug in Scala library, in deserialization of `HashMap1` 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).


was (Author: lgieron):
The issue here is the bug in Scala library, in deserialization of 'HashMap1' 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:101

[jira] [Comment Edited] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180216#comment-15180216
 ] 

Łukasz Gieroń edited comment on SPARK-13230 at 3/4/16 5:36 PM:
---

The issue here is the bug in Scala library, in deserialization of `HashMap1` 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).

PS. Not sure why Jira doesn't recognize my backticks markdown.


was (Author: lgieron):
The issue here is the bug in Scala library, in deserialization of `HashMap1` 
objects. When they get deserialized, the internal `kv` field does not get 
deserialized (is left `null`), which causes a `NullPointerException` in 
`merged`. I've fixed this is Scala library, and it fixes the issue.
I'm going to open a bug to Scala library and submit a pull request for it, and 
link that ticket here (if it's possible to link between Jiras).

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache

[jira] [Commented] (SPARK-13230) HashMap.merged not working properly with Spark

2016-03-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180234#comment-15180234
 ] 

Sean Owen commented on SPARK-13230:
---

Thanks, that's a great analysis. It sounds like we might need to close this as 
a Scala problem, and offer a workaround. For example, it's obviously possible 
to write a little function that accomplishes the same thing, and which I hope 
doesn't depend on serializing the same internal representation.

(PS JIRA does not use markdown. Use pairs of curly braces to {{format as code}}.

> HashMap.merged not working properly with Spark
> --
>
> Key: SPARK-13230
> URL: https://issues.apache.org/jira/browse/SPARK-13230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04.3, Scala 2.11.7, Spark 1.6.0
>Reporter: Alin Treznai
>
> Using HashMap.merged with Spark fails with NullPointerException.
> {noformat}
> import org.apache.spark.{SparkConf, SparkContext}
> import scala.collection.immutable.HashMap
> object MergeTest {
>   def mergeFn:(HashMap[String, Long], HashMap[String, Long]) => 
> HashMap[String, Long] = {
> case (m1, m2) => m1.merged(m2){ case (x,y) => (x._1, x._2 + y._2) }
>   }
>   def main(args: Array[String]) = {
> val input = Seq(HashMap("A" -> 1L), HashMap("A" -> 2L, "B" -> 
> 3L),HashMap("A" -> 2L, "C" -> 4L))
> val conf = new SparkConf().setAppName("MergeTest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val result = sc.parallelize(input).reduce(mergeFn)
> println(s"Result=$result")
> sc.stop()
>   }
> }
> {noformat}
> Error message:
> org.apache.spark.SparkDriverExecutionException: Execution error
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1169)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at MergeTest$.main(MergeTest.scala:21)
> at MergeTest.main(MergeTest.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> Caused by: java.lang.NullPointerException
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at 
> MergeTest$$anonfun$mergeFn$1$$anonfun$apply$1.apply(MergeTest.scala:12)
> at scala.collection.immutable.HashMap$$anon$2.apply(HashMap.scala:148)
> at 
> scala.collection.immutable.HashMap$HashMap1.updated0(HashMap.scala:200)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.updated0(HashMap.scala:322)
> at 
> scala.collection.immutable.HashMap$HashTrieMap.merge0(HashMap.scala:463)
> at scala.collection.immutable.HashMap.merged(HashMap.scala:117)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:12)
> at MergeTest$$anonfun$mergeFn$1.apply(MergeTest.scala:11)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1020)
> at 
> org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1017)
> at 
> org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:56)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1165)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(

[jira] [Commented] (SPARK-13048) EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel

2016-03-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180241#comment-15180241
 ] 

Joseph K. Bradley commented on SPARK-13048:
---

I'd say the best fix would be to add an option to LDA to not delete the last 
checkpoint.  I'd prefer to expose this as a Param in the spark.ml API, but it 
could be added to the spark.mllib API as well if necessary.

[~holdenk]  I agree we need to figure out how to handle/control caching and 
checkpointing within Pipelines, but that will have to wait for after 2.0.

[~jvstein]  We try to minimize the public API.  Although I agree with you about 
opening up APIs in principal, it have proven dangerous in practice.  Even when 
we mark things DeveloperApi, many users still use those APIs, making it 
difficult to change them in the future.

> EMLDAOptimizer deletes dependent checkpoint of DistributedLDAModel
> --
>
> Key: SPARK-13048
> URL: https://issues.apache.org/jira/browse/SPARK-13048
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
> Environment: Standalone Spark cluster
>Reporter: Jeff Stein
>
> In EMLDAOptimizer, all checkpoints are deleted before returning the 
> DistributedLDAModel.
> The most recent checkpoint is still necessary for operations on the 
> DistributedLDAModel under a couple scenarios:
> - The graph doesn't fit in memory on the worker nodes (e.g. very large data 
> set).
> - Late worker failures that require reading the now-dependent checkpoint.
> I ran into this problem running a 10M record LDA model in a memory starved 
> environment. The model consistently failed in either the {{collect at 
> LDAModel.scala:528}} stage (when converting to a LocalLDAModel) or in the 
> {{reduce at LDAModel.scala:563}} stage (when calling "describeTopics" on the 
> model). In both cases, a FileNotFoundException is thrown attempting to access 
> a checkpoint file.
> I'm not sure what the correct fix is here; it might involve a class signature 
> change. An alternative simple fix is to leave the last checkpoint around and 
> expect the user to clean the checkpoint directory themselves.
> {noformat}
> java.io.FileNotFoundException: File does not exist: 
> /hdfs/path/to/checkpoints/c8bd2b4e-27dd-47b3-84ec-3ff0bac04587/rdd-635/part-00071
> {noformat}
> Relevant code is included below.
> LDAOptimizer.scala:
> {noformat}
>   override private[clustering] def getLDAModel(iterationTimes: 
> Array[Double]): LDAModel = {
> require(graph != null, "graph is null, EMLDAOptimizer not initialized.")
> this.graphCheckpointer.deleteAllCheckpoints()
> // The constructor's default arguments assume gammaShape = 100 to ensure 
> equivalence in
> // LDAModel.toLocal conversion
> new DistributedLDAModel(this.graph, this.globalTopicTotals, this.k, 
> this.vocabSize,
>   Vectors.dense(Array.fill(this.k)(this.docConcentration)), 
> this.topicConcentration,
>   iterationTimes)
>   }
> {noformat}
> PeriodicCheckpointer.scala
> {noformat}
>   /**
>* Call this at the end to delete any remaining checkpoint files.
>*/
>   def deleteAllCheckpoints(): Unit = {
> while (checkpointQueue.nonEmpty) {
>   removeCheckpointFile()
> }
>   }
>   /**
>* Dequeue the oldest checkpointed Dataset, and remove its checkpoint files.
>* This prints a warning but does not fail if the files cannot be removed.
>*/
>   private def removeCheckpointFile(): Unit = {
> val old = checkpointQueue.dequeue()
> // Since the old checkpoint is not deleted by Spark, we manually delete 
> it.
> val fs = FileSystem.get(sc.hadoopConfiguration)
> getCheckpointFiles(old).foreach { checkpointFile =>
>   try {
> fs.delete(new Path(checkpointFile), true)
>   } catch {
> case e: Exception =>
>   logWarning("PeriodicCheckpointer could not remove old checkpoint 
> file: " +
> checkpointFile)
>   }
> }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint

2016-03-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180243#comment-15180243
 ] 

Joseph K. Bradley commented on SPARK-13434:
---

There are a few options here:
* Temp fix: Reduce the number of executors, as you suggested.
* Long-term for this RF implementation: Implement local training for deep 
trees.  Spilling the current tree to disk would help, but I'd guess that local 
training would have a bigger impact.
* Long-term fix via a separate RF implementation: I've been working for a long 
time on a column-partitioned implementation which will be better for tasks like 
yours with many features & deep trees.  It's making progress but not yet ready 
to merge into Spark.

> Reduce Spark RandomForest memory footprint
> --
>
> Key: SPARK-13434
> URL: https://issues.apache.org/jira/browse/SPARK-13434
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Ewan Higgs
>  Labels: decisiontree, mllib, randomforest
> Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate 
> datasets. This was raised in the a user's benchmarking game on github 
> (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there 
> was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the 
> RandomForest training on largish datasets on machines with 64G memory and the 
> following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an 
> input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
> --driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
> single machine by running {{jmap -histo:live }}. I took a sample 
> every 5 seconds and at the peak it looks like this:
> {code}
>  num #instances #bytes  class name
> --
>1:   5428073 8458773496  [D
>2:  12293653 4124641992  [I
>3:  32508964 1820501984  org.apache.spark.mllib.tree.model.Node
>4:  53068426 1698189632  org.apache.spark.mllib.tree.model.Predict
>5:  72853787 1165660592  scala.Some
>6:  16263408  910750848  
> org.apache.spark.mllib.tree.model.InformationGainStats
>7: 72969  390492744  [B
>8:   3327008  133080320  
> org.apache.spark.mllib.tree.impl.DTStatsAggregator
>9:   3754500  120144000  
> scala.collection.immutable.HashMap$HashMap1
>   10:   3318349  106187168  org.apache.spark.mllib.tree.model.Split
>   11:   3534946   84838704  
> org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:   3764745   60235920  java.lang.Integer
>   13:   3327008   53232128  
> org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:380804   45361144  [C
>   15:268887   34877128  
>   16:268887   34431568  
>   17:908377   34042760  [Lscala.collection.immutable.HashMap;
>   18:   110   2640  
> org.apache.spark.mllib.regression.LabeledPoint
>   19:   110   2640  org.apache.spark.mllib.linalg.SparseVector
>   20: 20206   25979864  
>   21:   100   2400  org.apache.spark.mllib.tree.impl.TreePoint
>   22:   100   2400  
> org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:908332   21799968  
> scala.collection.immutable.HashMap$HashTrieMap
>   24: 20206   20158864  
>   25: 17023   14380352  
>   26:16   13308288  
> [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
>   27:445797   10699128  scala.Tuple2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13680) Java UDAF with more than one intermediate argument returns wrong results

2016-03-04 Thread Yael Aharon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yael Aharon updated SPARK-13680:

Attachment: setup.hql

> Java UDAF with more than one intermediate argument returns wrong results
> 
>
> Key: SPARK-13680
> URL: https://issues.apache.org/jira/browse/SPARK-13680
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: CDH 5.5.2
>Reporter: Yael Aharon
> Attachments: data.csv, setup.hql
>
>
> I am trying to incorporate the Java UDAF from 
> https://github.com/apache/spark/blob/master/sql/hive/src/test/java/org/apache/spark/sql/hive/aggregate/MyDoubleAvg.java
>  into an SQL query. 
> I registered the UDAF like this:
>  sqlContext.udf().register("myavg", new MyDoubleAvg());
> My SQL query is:
> SELECT AVG(seqi) AS `avg_seqi`, AVG(seqd) AS `avg_seqd`, AVG(ci) AS `avg_ci`, 
> AVG(cd) AS `avg_cd`, AVG(stdevd) AS `avg_stdevd`, AVG(stdevi) AS 
> `avg_stdevi`, MAX(seqi) AS `max_seqi`, MAX(seqd) AS `max_seqd`, MAX(ci) AS 
> `max_ci`, MAX(cd) AS `max_cd`, MAX(stdevd) AS `max_stdevd`, MAX(stdevi) AS 
> `max_stdevi`, MIN(seqi) AS `min_seqi`, MIN(seqd) AS `min_seqd`, MIN(ci) AS 
> `min_ci`, MIN(cd) AS `min_cd`, MIN(stdevd) AS `min_stdevd`, MIN(stdevi) AS 
> `min_stdevi`,SUM(seqi) AS `sum_seqi`, SUM(seqd) AS `sum_seqd`, SUM(ci) AS 
> `sum_ci`, SUM(cd) AS `sum_cd`, SUM(stdevd) AS `sum_stdevd`, SUM(stdevi) AS 
> `sum_stdevi`, myavg(seqd) as `myavg_seqd`,  AVG(zero) AS `avg_zero`, 
> AVG(nulli) AS `avg_nulli`,AVG(nulld) AS `avg_nulld`, SUM(zero) AS `sum_zero`, 
> SUM(nulli) AS `sum_nulli`,SUM(nulld) AS `sum_nulld`,MAX(zero) AS `max_zero`, 
> MAX(nulli) AS `max_nulli`,MAX(nulld) AS `max_nulld`,count( * ) AS 
> `count_all`, count(nulli) AS `count_nulli` FROM mytable
> As soon as I add the UDAF myavg to the SQL, all the results become incorrect. 
> When I remove the call to the UDAF, the results are correct.
> I was able to go around the issue by modifying bufferSchema of the UDAF to 
> use an array and the corresponding update and merge methods. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13494) Cannot sort on a column which is of type "array"

2016-03-04 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180289#comment-15180289
 ] 

Xiao Li commented on SPARK-13494:
-

Can you try the other newer versions? You know, a lot of issues have been fixed 
in each release.

> Cannot sort on a column which is of type "array"
> 
>
> Key: SPARK-13494
> URL: https://issues.apache.org/jira/browse/SPARK-13494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yael Aharon
>
> Executing the following SQL results in an error if columnName refers to a 
> column of type array
> SELECT * FROM myTable ORDER BY columnName ASC LIMIT 50
> The error is 
> org.apache.spark.sql.AnalysisException: cannot resolve 'columnName ASC' due 
> to data type mismatch: cannot sort data type array



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13633) Move parser classes to o.a.s.sql.catalyst.parser package

2016-03-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-13633.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Move parser classes to o.a.s.sql.catalyst.parser package
> 
>
> Key: SPARK-13633
> URL: https://issues.apache.org/jira/browse/SPARK-13633
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13681) Reimplement CommitFailureTestRelationSuite

2016-03-04 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13681:


 Summary: Reimplement CommitFailureTestRelationSuite
 Key: SPARK-13681
 URL: https://issues.apache.org/jira/browse/SPARK-13681
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13681) Reimplement CommitFailureTestRelationSuite

2016-03-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13681:
-
Description: This test case got broken by 
[#11509|https://github.com/apache/spark/pull/11509].  We should reimplement it 
as a format.

> Reimplement CommitFailureTestRelationSuite
> --
>
> Key: SPARK-13681
> URL: https://issues.apache.org/jira/browse/SPARK-13681
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Blocker
>
> This test case got broken by 
> [#11509|https://github.com/apache/spark/pull/11509].  We should reimplement 
> it as a format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13682) Finalize the public API for FileFormat

2016-03-04 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-13682:


 Summary: Finalize the public API for FileFormat
 Key: SPARK-13682
 URL: https://issues.apache.org/jira/browse/SPARK-13682
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust


The current file format interface needs to be cleaned up before its acceptable 
for public consumption:
 - Have a version that takes Row and does a conversion, hide the internal API.
 - Remove bucketing
 - Remove RDD and the broadcastedConf
 - Remove SQLContext (maybe include SparkSession?)
 - Pass a better conf object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180427#comment-15180427
 ] 

Marcelo Vanzin commented on SPARK-13670:


After some fun playing with arcane bash syntax, here's something that worked 
for me:

{code}
run_command() {
  CMD=()
  while IFS='' read -d '' -r ARG; do
echo "line: $ARG"
CMD+=("$ARG")
  done
  if [ ${#CMD[@]} -gt 0 ]; then
exec "${CMD[@]}"
  fi
}

set -o pipefail
"$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@" | 
run_command
{code}

Example:

{noformat}
$ ./bin/spark-shell 
NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
ahead of assembly.
Exception in thread "main" java.lang.IllegalArgumentException: Testing, 
testing, testing...
at org.apache.spark.launcher.Main.main(Main.java:93)

$ echo $?
1
{noformat}



> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180444#comment-15180444
 ] 

Marcelo Vanzin commented on SPARK-13670:


Note that will probably leave a bash process running somewhere alongside the 
Spark jvm, so probably would need tweaks to avoid that... bash is fun.


> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13670) spark-class doesn't bubble up error from launcher command

2016-03-04 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180455#comment-15180455
 ] 

Marcelo Vanzin commented on SPARK-13670:


Actually scrap that, it breaks things when the spark-shell actually runs... 
back to the drawing board.

> spark-class doesn't bubble up error from launcher command
> -
>
> Key: SPARK-13670
> URL: https://issues.apache.org/jira/browse/SPARK-13670
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
>Reporter: Mark Grover
>Priority: Minor
>
> There's a particular snippet in spark-class 
> [here|https://github.com/apache/spark/blob/master/bin/spark-class#L86] that 
> runs the spark-launcher code in a subshell.
> {code}
> # The launcher library will print arguments separated by a NULL character, to 
> allow arguments with
> # characters that would be otherwise interpreted by the shell. Read that in a 
> while loop, populating
> # an array that will be used to exec the final command.
> CMD=()
> while IFS= read -d '' -r ARG; do
>   CMD+=("$ARG")
> done < <("$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main 
> "$@")
> {code}
> The problem is that the if the launcher Main fails, this code still still 
> returns success and continues, even though the top level script is marked 
> {{set -e}}. This is because the launcher.Main is run within a subshell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >