date:20151104

[jira] [Created] (SPARK-11524) Support SparkR with Mesos cluster

2015-11-04 Thread Sun Rui (JIRA)

Sun Rui created SPARK-11524:
---

 Summary: Support SparkR with Mesos cluster
 Key: SPARK-11524
 URL: https://issues.apache.org/jira/browse/SPARK-11524
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 1.5.1
Reporter: Sun Rui






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11507) Error thrown when using BlockMatrix.add

2015-11-04 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991261#comment-14991261
 ] 

yuhao yang commented on SPARK-11507:


Looking into it. Should be a bug. Breeze may remove the extra end after 
addition. 

> Error thrown when using BlockMatrix.add
> ---
>
> Key: SPARK-11507
> URL: https://issues.apache.org/jira/browse/SPARK-11507
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.5.0
> Environment: Mac/local machine, EC2
> Scala
>Reporter: Kareem Alhazred
>Priority: Minor
>
> In certain situations when adding two block matrices, I get an error 
> regarding colPtr and the operation fails.  External issue URL includes full 
> error and code for reproducing the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11507) Error thrown when using BlockMatrix.add

2015-11-04 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991261#comment-14991261
 ] 

yuhao yang edited comment on SPARK-11507 at 11/5/15 7:21 AM:
-

Looking into it. Should be a bug. Breeze may remove the extra end in colPtr 
after addition. 


was (Author: yuhaoyan):
Looking into it. Should be a bug. Breeze may remove the extra end after 
addition. 

> Error thrown when using BlockMatrix.add
> ---
>
> Key: SPARK-11507
> URL: https://issues.apache.org/jira/browse/SPARK-11507
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1, 1.5.0
> Environment: Mac/local machine, EC2
> Scala
>Reporter: Kareem Alhazred
>Priority: Minor
>
> In certain situations when adding two block matrices, I get an error 
> regarding colPtr and the operation fails.  External issue URL includes full 
> error and code for reproducing the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11475) DataFrame API saveAsTable() does not work well for HDFS HA

2015-11-04 Thread zhangxiongfei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991258#comment-14991258
 ] 

zhangxiongfei commented on SPARK-11475:
---

Hi [~rekhajoshm]
Thanks for pointing out my wrong Hive meta configuration.
This is not a Spark issue.

> DataFrame API saveAsTable() does not work well for HDFS HA
> --
>
> Key: SPARK-11475
> URL: https://issues.apache.org/jira/browse/SPARK-11475
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: Hadoop 2.4 & Spark 1.5.1
>Reporter: zhangxiongfei
> Attachments: dataFrame_saveAsTable.txt, hdfs-site.xml, hive-site.xml
>
>
> I was trying to save a DF to Hive using following code:
> {quote}
> sqlContext.range(1L,1000L,2L,2).coalesce(1).saveAsTable("dataframeTable")
> {quote}
> But got below exception:
> {quote}
> arning: there were 1 deprecation warning(s); re-run with -deprecation for 
> details
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category READ is not supported in state standby
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:87)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1610)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1193)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3516)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:785)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(
> {quote}
> *My Hive configuration is* :
> {quote}
>
>   hive.metastore.warehouse.dir
>   */apps/hive/warehouse*
> 
> {quote}
> It seems that the hdfs HA is not configured,then I tried below code:
> {quote}
> sqlContext.range(1L,1000L,2L,2).coalesce(1).saveAsParquetFile("hdfs://bitautodmp/apps/hive/warehouse/dataframeTable")
> {quote}
> I could verified that  API *saveAsParquetFile* worked well by following 
> commands:
> {quote}
> *hadoop fs -ls /apps/hive/warehouse/dataframeTable*
> Found 4 items
> -rw-r--r--   3 zhangxf hdfs  0 2015-11-03 17:57 
> */apps/hive/warehouse/dataframeTable/_SUCCESS*
> -rw-r--r--   3 zhangxf hdfs199 2015-11-03 17:57 
> */apps/hive/warehouse/dataframeTable/_common_metadata*
> -rw-r--r--   3 zhangxf hdfs325 2015-11-03 17:57 
> */apps/hive/warehouse/dataframeTable/_metadata*
> -rw-r--r--   3 zhangxf hdfs   1098 2015-11-03 17:57 
> */apps/hive/warehouse/dataframeTable/part-r-0-a05a9bf3-b2a6-40e5-b180-818efb2a0f54.gz.parquet*
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991239#comment-14991239
 ] 

Apache Spark commented on SPARK-4557:
-

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/9488

> Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a 
> Function<..., Void>
> ---
>
> Key: SPARK-4557
> URL: https://issues.apache.org/jira/browse/SPARK-4557
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Alexis Seigneurin
>Priority: Minor
>  Labels: starter
>
> In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You 
> have to write:
> {code:java}
> .foreachRDD(items -> {
> ...;
> return null;
> });
> {code}
> Instead of:
> {code:java}
> .foreachRDD(items -> ...);
> {code}
> This is because the foreachRDD method accepts a Function, Void> 
> instead of a VoidFunction>. This would make sense to change it 
> to a VoidFunction as, in Spark's API, the foreach method already accepts a 
> VoidFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4557:
---

Assignee: (was: Apache Spark)

> Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a 
> Function<..., Void>
> ---
>
> Key: SPARK-4557
> URL: https://issues.apache.org/jira/browse/SPARK-4557
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Alexis Seigneurin
>Priority: Minor
>  Labels: starter
>
> In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You 
> have to write:
> {code:java}
> .foreachRDD(items -> {
> ...;
> return null;
> });
> {code}
> Instead of:
> {code:java}
> .foreachRDD(items -> ...);
> {code}
> This is because the foreachRDD method accepts a Function, Void> 
> instead of a VoidFunction>. This would make sense to change it 
> to a VoidFunction as, in Spark's API, the foreach method already accepts a 
> VoidFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4557:
---

Assignee: Apache Spark

> Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a 
> Function<..., Void>
> ---
>
> Key: SPARK-4557
> URL: https://issues.apache.org/jira/browse/SPARK-4557
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: Alexis Seigneurin
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You 
> have to write:
> {code:java}
> .foreachRDD(items -> {
> ...;
> return null;
> });
> {code}
> Instead of:
> {code:java}
> .foreachRDD(items -> ...);
> {code}
> This is because the foreachRDD method accepts a Function, Void> 
> instead of a VoidFunction>. This would make sense to change it 
> to a VoidFunction as, in Spark's API, the foreach method already accepts a 
> VoidFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11523) spark_partition_id() considered invalid function

2015-11-04 Thread Simeon Simeonov (JIRA)

Simeon Simeonov created SPARK-11523:
---

 Summary: spark_partition_id() considered invalid function
 Key: SPARK-11523
 URL: https://issues.apache.org/jira/browse/SPARK-11523
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Simeon Simeonov


{{spark_partition_id()}} works correctly in top-level {{SELECT}} statements but 
is not recognized in {{SELECT}} statements that define views. It seems DDL 
processing vs. execution in Spark SQL use two different parsers and/or 
environments.

In the following examples, instead of the {{test_data}} table you can use any 
defined table name.

A top-level statement works:

{code}
scala> ctx.sql("select spark_partition_id() as partition_id from 
test_data").show
++
|partition_id|
++
|   0|
...
|   0|
++
only showing top 20 rows
{code}

The same query in a view definition fails with {{Invalid function 
'spark_partition_id'}}.

{code}
scala> ctx.sql("create view test_view as select spark_partition_id() as 
partition_id from test_data")
15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
select spark_partition_id() as partition_id from test_data
15/11/05 01:05:38 INFO ParseDriver: Parse Completed
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO ParseDriver: Parsing command: create view test_view as 
select spark_partition_id() as partition_id from test_data
15/11/05 01:05:38 INFO ParseDriver: Parse Completed
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO PerfLogger: 
15/11/05 01:05:38 INFO CalcitePlanner: Starting Semantic Analysis
15/11/05 01:05:38 INFO CalcitePlanner: Creating view default.test_view 
position=12
15/11/05 01:05:38 INFO HiveMetaStore: 0: get_database: default
15/11/05 01:05:38 INFO audit: ugi=sim   ip=unknown-ip-addr  
cmd=get_database: default
15/11/05 01:05:38 INFO CalcitePlanner: Completed phase 1 of Semantic Analysis
15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for source tables
15/11/05 01:05:38 INFO HiveMetaStore: 0: get_table : db=default tbl=test_data
15/11/05 01:05:38 INFO audit: ugi=sim   ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=test_data
15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for subqueries
15/11/05 01:05:38 INFO CalcitePlanner: Get metadata for destination tables
15/11/05 01:05:38 INFO Context: New scratch dir is 
hdfs://localhost:9000/tmp/hive/sim/3fce9b7e-011f-4632-b673-e29067779fa0/hive_2015-11-05_01-05-38_518_4526721093949438849-1
15/11/05 01:05:38 INFO CalcitePlanner: Completed getting MetaData in Semantic 
Analysis
15/11/05 01:05:38 INFO BaseSemanticAnalyzer: Not invoking CBO because the 
statement doesn't have QUERY or EXPLAIN as root and not a CTAS; has create view
15/11/05 01:05:38 ERROR Driver: FAILED: SemanticException [Error 10011]: Line 
1:32 Invalid function 'spark_partition_id'
org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:32 Invalid function 
'spark_partition_id'
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:925)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1265)
at 
org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher.dispatch(DefaultRuleDispatcher.java:90)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:95)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:79)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.walk(DefaultGraphWalker.java:133)
at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:110)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:205)
at 
org.apache.hadoop.hive.ql.parse.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:149)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:10512)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:10468)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3840)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:3619)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:8956)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:8911)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:9756)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genP

[jira] [Commented] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991208#comment-14991208
 ] 

Apache Spark commented on SPARK-2533:
-

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/9117

> Show summary of locality level of completed tasks in the each stage page of 
> web UI
> --
>
> Key: SPARK-2533
> URL: https://issues.apache.org/jira/browse/SPARK-2533
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> When the number of tasks is very large, it is impossible to know how many 
> tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
> stage page of web UI. It would be better to show the summary of task locality 
> level in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10729) word2vec model save for python

2015-11-04 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991206#comment-14991206
 ] 

Yu Ishikawa commented on SPARK-10729:
-

Sorry, the cause isn't `@inherit_doc`. I misunderstood.
Anyway, we should discuss the documentation.

> word2vec model save for python
> --
>
> Key: SPARK-10729
> URL: https://issues.apache.org/jira/browse/SPARK-10729
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Joseph A Gartner III
> Fix For: 1.5.0
>
>
> The ability to save a word2vec model has not been ported to python, and would 
> be extremely useful to have given the long training period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI

2015-11-04 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991197#comment-14991197
 ] 

Jean-Baptiste Onofré commented on SPARK-2533:
-

New clean PR.

> Show summary of locality level of completed tasks in the each stage page of 
> web UI
> --
>
> Key: SPARK-2533
> URL: https://issues.apache.org/jira/browse/SPARK-2533
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> When the number of tasks is very large, it is impossible to know how many 
> tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
> stage page of web UI. It would be better to show the summary of task locality 
> level in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2533) Show summary of locality level of completed tasks in the each stage page of web UI

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991187#comment-14991187
 ] 

Apache Spark commented on SPARK-2533:
-

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/9487

> Show summary of locality level of completed tasks in the each stage page of 
> web UI
> --
>
> Key: SPARK-2533
> URL: https://issues.apache.org/jira/browse/SPARK-2533
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Masayoshi TSUZUKI
>Priority: Minor
>
> When the number of tasks is very large, it is impossible to know how many 
> tasks were executed under (PROCESS_LOCAL/NODE_LOCAL/RACK_LOCAL) from the 
> stage page of web UI. It would be better to show the summary of task locality 
> level in web UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-11-04 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991160#comment-14991160
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

Hi Phil, I'm testing a fix on Kryo right now. I'm testing with different use 
case. I'm keeping you posted.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11486) TungstenAggregate may fail when switching to sort-based aggregation when there are string in grouping columns and no aggregation buffer columns

2015-11-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11486.

Resolution: Fixed

Issue resolved by pull request 9383
[https://github.com/apache/spark/pull/9383]

> TungstenAggregate may fail when switching to sort-based aggregation when 
> there are string in grouping columns and no aggregation buffer columns
> ---
>
> Key: SPARK-11486
> URL: https://issues.apache.org/jira/browse/SPARK-11486
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.6.0
>
>
> This was discovered by [~davies]:
> {code}
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:193)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(generated.java:40)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:643)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:517)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:779)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:128)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:137)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$3.apply(TungstenAggregate.scala:137)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/10/28 23:25:08 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3, 
> runningTasks: 0
> {code}
> See discussion at 
> https://github.com/apache/spark/pull/9383#issuecomment-153466959



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11425) Improve hybrid aggregation (sort-based after hash-based)

2015-11-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11425.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9383
[https://github.com/apache/spark/pull/9383]

> Improve hybrid aggregation (sort-based after hash-based)
> 
>
> Key: SPARK-11425
> URL: https://issues.apache.org/jira/browse/SPARK-11425
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Davies Liu
> Fix For: 1.6.0
>
>
> After aggregation, the dataset could be smaller than inputs, so it's better 
> to do hash based aggregation for all inputs, then using sort based 
> aggregation to merge them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-04 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-11500:
-
Description: 
When executing 

{{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
pathTwo).printSchema()}}

The order of columns is not deterministic, showing up in a different order 
sometimes.

This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
{{ParquetRelation}} extends as you know). When 
{{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} 
which messes up the order of {{Array[FileStatus]}}.

So, after retrieving the list of leaf files including {{_metadata}} and 
{{_common_metadata}},  this starts to merge (separately and if necessary) the 
{{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
{{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
column order having the leading columns (of the first file) which the other 
files do not have.

I think this can be resolved by using {{LinkedHashSet}}.



in a simple view,
If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which 
column shows first since It is not deterministic.

1. Read file list (A and B)

2. Not deterministic order of (A and B or B and A) as I said.

3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and 
A), (which maybe also should be {{reduceOptionRight}} or {{reduceOptionLeft}}).

4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B and 
A.





  was:
When executing 

{{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
pathTwo).printSchema()}}

The order of columns is not deterministic, showing up in a different order 
sometimes.

This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
{{ParquetRelation}} extends as you know). When 
{{FileStatusCache.listLeafFiles()}} is called, this returns {{Set[FileStatus]}} 
which messes up the order of {{Array[FileStatus]}}.

So, after retrieving the list of leaf files including {{_metadata}} and 
{{_common_metadata}},  this starts to merge (separately and if necessary) the 
{{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
{{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
column order having the leading columns (of the first file) which the other 
files do not have.

I think this can be resolved by using {{LinkedHashSet}}.


> Not deterministic order of columns when using merging schemas.
> --
>
> Key: SPARK-11500
> URL: https://issues.apache.org/jira/browse/SPARK-11500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.
> in a simple view,
> If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which 
> column shows first since It is not deterministic.
> 1. Read file list (A and B)
> 2. Not deterministic order of (A and B or B and A) as I said.
> 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and 
> A), (which maybe also should be {{reduceOptionRight}} or 
> {{reduceOptionLeft}}).
> 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B 
> and A.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11514) Pass random seed to spark.ml DecisionTree*

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11514:


Assignee: Apache Spark  (was: Yu Ishikawa)

> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-11514
> URL: https://issues.apache.org/jira/browse/SPARK-11514
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11514) Pass random seed to spark.ml DecisionTree*

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991117#comment-14991117
 ] 

Apache Spark commented on SPARK-11514:
--

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/9486

> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-11514
> URL: https://issues.apache.org/jira/browse/SPARK-11514
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11514) Pass random seed to spark.ml DecisionTree*

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11514:


Assignee: Yu Ishikawa  (was: Apache Spark)

> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-11514
> URL: https://issues.apache.org/jira/browse/SPARK-11514
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10838) Repeat to join one DataFrame twice，there will be AnalysisException.

2015-11-04 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991108#comment-14991108
 ] 

Xiao Li commented on SPARK-10838:
-

The fix is ready. Writing unit test cases now. 

> Repeat to join one DataFrame twice，there will be AnalysisException.
> ---
>
> Key: SPARK-10838
> URL: https://issues.apache.org/jira/browse/SPARK-10838
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Yun Zhao
>
> The detail of exception is:
> {quote}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) col_a#1 missing from col_a#0,col_b#2,col_a#3,col_b#4 in operator 
> !Join Inner, Some((col_b#2 = col_a#1));
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:908)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:554)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:521)
> {quote}
> The related codes are：
> {quote}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.\{SparkContext, SparkConf}
> object DFJoinTest extends App \{
>   case class Foo(col_a: String)
>   case class Bar(col_a: String, col_b: String)
>   val sc = new SparkContext(new 
> SparkConf().setMaster("local").setAppName("DFJoinTest"))
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
>   val df1 = sc.parallelize(Array("1")).map(_.split(",")).map(p => 
> Foo(p(0))).toDF()
>   val df2 = sc.parallelize(Array("1,1")).map(_.split(",")).map(p => Bar(p(0), 
> p(1))).toDF()
>   val df3 = df1.join(df2, df1("col_a") === df2("col_a")).select(df1("col_a"), 
> $"col_b")
>   df3.join(df2, df3("col_b") === df2("col_a")).show()
>   //  val df4 = df2.as("df4")
>   //  df3.join(df4, df3("col_b") === df4("col_a")).show()
>   //  df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show()
>   sc.stop()
> }
> {quote}
> When uses
> {quote}
> val df4 = df2.as("df4")
> df3.join(df4, df3("col_b") === df4("col_a")).show()
> {quote}
> there's errors,but when uses
> {quote}
> df3.join(df2.as("df4"), df3("col_b") === $"df4.col_a").show()
> {quote}
> it's normal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11522) input_file_name() returns "" for external tables

2015-11-04 Thread Simeon Simeonov (JIRA)

Simeon Simeonov created SPARK-11522:
---

 Summary: input_file_name() returns "" for external tables
 Key: SPARK-11522
 URL: https://issues.apache.org/jira/browse/SPARK-11522
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Simeon Simeonov


Given an external table definition where the data consists of many CSV files, 
{{input_file_path()}} returns empty strings.

Table definition:

{code}
CREATE EXTERNAL TABLE external_test(page_id INT, impressions INT) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar" = "\"",
   "escapeChar"= "\\"
)  
LOCATION 'file:///Users/sim/spark/test/external_test'
{code}

Query: 

{code}
sql("SELECT input_file_name() as file FROM external_test").show
{code}

Output:

{code}
++
|file|
++
||
||
...
||
++
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11521) LinearRegressionSummary needs to clarify which metrics are weighted

2015-11-04 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-11521:
-

 Summary: LinearRegressionSummary needs to clarify which metrics 
are weighted
 Key: SPARK-11521
 URL: https://issues.apache.org/jira/browse/SPARK-11521
 Project: Spark
  Issue Type: Documentation
  Components: ML
Reporter: Joseph K. Bradley
Priority: Critical


Some metrics in the summary are weighted (e.g., devianceResiduals), but the 
ones computed via RegressionMetrics are not.  This should be documented very 
clearly (unless this gets fixed before the next release in [SPARK-11520]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11520) RegressionMetrics should support instance weights

2015-11-04 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-11520:
-

 Summary: RegressionMetrics should support instance weights
 Key: SPARK-11520
 URL: https://issues.apache.org/jira/browse/SPARK-11520
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


This will be important to improve LinearRegressionSummary, which currently has 
a mix of weighted and unweighted metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11519) Spark MemoryStore with hadoop SequenceFile cache the values is same record.

2015-11-04 Thread xukaiqiang (JIRA)

xukaiqiang created SPARK-11519:
--

 Summary: Spark MemoryStore with hadoop SequenceFile cache the 
values is same record.
 Key: SPARK-11519
 URL: https://issues.apache.org/jira/browse/SPARK-11519
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
 Environment: jdk.1.7.0, spark1.1.0, hadoop2.3.0
Reporter: xukaiqiang


use spark create newAPIHadoopFile which is SequenceFile format, when use spark 
memory cache, the cache save the same java object .

read  hadoop file with SequenceFileRecordReader save as NewHadoopRDD. the kv 
values as  :
[1, com.data.analysis.domain.RecordObject@54cdb594]
[2, com.data.analysis.domain.RecordObject@54cdb594]
[3, com.data.analysis.domain.RecordObject@54cdb594]
although the value is the same java object , but i am sure the context is not 
same .
jsut use spark memory cache, the  MemoryStore vector save all records, but the 
value is the last vlaue from newHadoopRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11473) R-like summary statistics with intercept for OLS via normal equation solver

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991022#comment-14991022
 ] 

Apache Spark commented on SPARK-11473:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9485

> R-like summary statistics with intercept for OLS via normal equation solver
> ---
>
> Key: SPARK-11473
> URL: https://issues.apache.org/jira/browse/SPARK-11473
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>
> SPARK-9836 has provided R-like summary statistics for coefficients, we should 
> also add this statistics for intercept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11473) R-like summary statistics with intercept for OLS via normal equation solver

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11473:


Assignee: (was: Apache Spark)

> R-like summary statistics with intercept for OLS via normal equation solver
> ---
>
> Key: SPARK-11473
> URL: https://issues.apache.org/jira/browse/SPARK-11473
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>
> SPARK-9836 has provided R-like summary statistics for coefficients, we should 
> also add this statistics for intercept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11473) R-like summary statistics with intercept for OLS via normal equation solver

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11473:


Assignee: Apache Spark

> R-like summary statistics with intercept for OLS via normal equation solver
> ---
>
> Key: SPARK-11473
> URL: https://issues.apache.org/jira/browse/SPARK-11473
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> SPARK-9836 has provided R-like summary statistics for coefficients, we should 
> also add this statistics for intercept.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-11-04 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991018#comment-14991018
 ] 

yuhao yang commented on SPARK-9273:
---

Hi  [~avulanov]. I've refactored the CNN in 
[https://github.com/hhbyyh/mCNN/tree/master/src/communityInterface] according 
to ANN interface. Also I made an example in [Driver.scala| 
https://github.com/hhbyyh/mCNN/blob/master/src/communityInterface/Driver.scala] 
to demo how it can be combined with existing layers to resolve MNist.

Right now I'm trying to optimize the Convolution operation, referring to 
[http://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/]. 
Appreciate if you can shed some light  :-).

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11518) The script spark-submit.cmd can not handle spark directory with space.

2015-11-04 Thread Cele Liu (JIRA)

Cele Liu created SPARK-11518:


 Summary: The script spark-submit.cmd can not handle spark 
directory with space.
 Key: SPARK-11518
 URL: https://issues.apache.org/jira/browse/SPARK-11518
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Cele Liu


Unzip the spark into D:\Program Files\Spark, when we submit the app, we got 
error:

'D:\Program' is not recognized as an internal or external command,
operable program or batch file.

In spark-submit.cmd, the script does not handle space:
cmd /V /E /C %~dp0spark-submit2.cmd %*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel

2015-11-04 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990974#comment-14990974
 ] 

yuhao yang commented on SPARK-10809:


working on this.

> Single-document topicDistributions method for LocalLDAModel
> ---
>
> Key: SPARK-10809
> URL: https://issues.apache.org/jira/browse/SPARK-10809
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> We could provide a single-document topicDistributions method for 
> LocalLDAModel to allow for quick queries which avoid RDD operations.  
> Currently, the user must use an RDD of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990975#comment-14990975
 ] 

Apache Spark commented on SPARK-10809:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/9484

> Single-document topicDistributions method for LocalLDAModel
> ---
>
> Key: SPARK-10809
> URL: https://issues.apache.org/jira/browse/SPARK-10809
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> We could provide a single-document topicDistributions method for 
> LocalLDAModel to allow for quick queries which avoid RDD operations.  
> Currently, the user must use an RDD of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10809:


Assignee: (was: Apache Spark)

> Single-document topicDistributions method for LocalLDAModel
> ---
>
> Key: SPARK-10809
> URL: https://issues.apache.org/jira/browse/SPARK-10809
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> We could provide a single-document topicDistributions method for 
> LocalLDAModel to allow for quick queries which avoid RDD operations.  
> Currently, the user must use an RDD of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10809) Single-document topicDistributions method for LocalLDAModel

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10809:


Assignee: Apache Spark

> Single-document topicDistributions method for LocalLDAModel
> ---
>
> Key: SPARK-10809
> URL: https://issues.apache.org/jira/browse/SPARK-10809
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> We could provide a single-document topicDistributions method for 
> LocalLDAModel to allow for quick queries which avoid RDD operations.  
> Currently, the user must use an RDD of documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11517) Calc partitions in parallel for multiple partitions table

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990962#comment-14990962
 ] 

Apache Spark commented on SPARK-11517:
--

User 'zhichao-li' has created a pull request for this issue:
https://github.com/apache/spark/pull/9483

> Calc partitions in parallel for multiple partitions table
> -
>
> Key: SPARK-11517
> URL: https://issues.apache.org/jira/browse/SPARK-11517
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: zhichao-li
>Priority: Minor
>
> Currently we calculate the getPartitions for each "hive partition" in 
> sequence way, it would be faster if we can parallel this on driver side. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11517) Calc partitions in parallel for multiple partitions table

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11517:


Assignee: (was: Apache Spark)

> Calc partitions in parallel for multiple partitions table
> -
>
> Key: SPARK-11517
> URL: https://issues.apache.org/jira/browse/SPARK-11517
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: zhichao-li
>Priority: Minor
>
> Currently we calculate the getPartitions for each "hive partition" in 
> sequence way, it would be faster if we can parallel this on driver side. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11517) Calc partitions in parallel for multiple partitions table

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11517:


Assignee: Apache Spark

> Calc partitions in parallel for multiple partitions table
> -
>
> Key: SPARK-11517
> URL: https://issues.apache.org/jira/browse/SPARK-11517
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: zhichao-li
>Assignee: Apache Spark
>Priority: Minor
>
> Currently we calculate the getPartitions for each "hive partition" in 
> sequence way, it would be faster if we can parallel this on driver side. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11517) Calc partitions in parallel for multiple partitions table

2015-11-04 Thread zhichao-li (JIRA)

zhichao-li created SPARK-11517:
--

 Summary: Calc partitions in parallel for multiple partitions table
 Key: SPARK-11517
 URL: https://issues.apache.org/jira/browse/SPARK-11517
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: zhichao-li
Priority: Minor


Currently we calculate the getPartitions for each "hive partition" in sequence 
way, it would be faster if we can parallel this on driver side. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11516) Spark application cannot be found from JSON API even though it exists

2015-11-04 Thread Matt Cheah (JIRA)

Matt Cheah created SPARK-11516:
--

 Summary: Spark application cannot be found from JSON API even 
though it exists
 Key: SPARK-11516
 URL: https://issues.apache.org/jira/browse/SPARK-11516
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.1, 1.4.1
Reporter: Matt Cheah


I'm running a Spark standalone cluster on my mac and playing with the JSON API. 
I start both the master and the worker daemons, then start a Spark shell.

When I hit

{code}
http://localhost:8080/api/v1/applications
{code}

I get back

{code}
[ {
  "id" : "app-20151104181347-",
  "name" : "Spark shell",
  "attempts" : [ {
"startTime" : "2015-11-05T02:13:47.980GMT",
"endTime" : "1969-12-31T23:59:59.999GMT",
"sparkUser" : "mcheah",
"completed" : false
  } ]
} ]
{code}

But when I hit

{code}
http://localhost:8080/api/v1/applications/app-20151104181347-/executors
{code}

To look for executor data for the job, I get back

{code}
no such app: app-20151104181347-
{code}

even though the application exists.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11453) append data to partitioned table will messes up the result

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990951#comment-14990951
 ] 

Apache Spark commented on SPARK-11453:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9482

> append data to partitioned table will messes up the result
> --
>
> Key: SPARK-11453
> URL: https://issues.apache.org/jira/browse/SPARK-11453
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10785) Scale QuantileDiscretizer using distributed binning

2015-11-04 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990948#comment-14990948
 ] 

holdenk commented on SPARK-10785:
-

So looking at the tree work it looks like just did a grouByKey for each column 
index - which isn't very useful if we've only got a single column (although its 
quite possible I miss read some of that). I can do something useful though with 
just a single column (just does a sort on the RDD and uses the sorted RDD for 
the quantiles) if that sounds like what we are looking for?

> Scale QuantileDiscretizer using distributed binning
> ---
>
> Key: SPARK-10785
> URL: https://issues.apache.org/jira/browse/SPARK-10785
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> [SPARK-10064] improves binning in decision trees by distributing the 
> computation.  QuantileDiscretizer should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10785) Scale QuantileDiscretizer using distributed binning

2015-11-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990930#comment-14990930
 ] 

Joseph K. Bradley commented on SPARK-10785:
---

Yes, we should sample still.

Extensions to multiple input columns is a different issue, and we can do that 
later if needed.

Other than that, this should be analogous to the tree work.

> Scale QuantileDiscretizer using distributed binning
> ---
>
> Key: SPARK-10785
> URL: https://issues.apache.org/jira/browse/SPARK-10785
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> [SPARK-10064] improves binning in decision trees by distributing the 
> computation.  QuantileDiscretizer should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11515) QuantileDiscretizer should take random seed

2015-11-04 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-11515:
-

 Summary: QuantileDiscretizer should take random seed
 Key: SPARK-11515
 URL: https://issues.apache.org/jira/browse/SPARK-11515
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


QuantileDiscretizer takes a random sample to select bins.  It currently does 
not specify a seed for the XORShiftRandom, but it should take a seed by 
extending the HasSeed Param.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10745) Separate configs between shuffle and RPC

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10745:


Assignee: Apache Spark

> Separate configs between shuffle and RPC
> 
>
> Key: SPARK-10745
> URL: https://issues.apache.org/jira/browse/SPARK-10745
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> SPARK-6028 uses network module to implement RPC. However, there are some 
> configurations named with `spark.shuffle` prefix in the network module. We 
> should refactor them and make sure the user can control them in shuffle and 
> RPC separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10745) Separate configs between shuffle and RPC

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10745:


Assignee: (was: Apache Spark)

> Separate configs between shuffle and RPC
> 
>
> Key: SPARK-10745
> URL: https://issues.apache.org/jira/browse/SPARK-10745
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> SPARK-6028 uses network module to implement RPC. However, there are some 
> configurations named with `spark.shuffle` prefix in the network module. We 
> should refactor them and make sure the user can control them in shuffle and 
> RPC separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10745) Separate configs between shuffle and RPC

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990926#comment-14990926
 ] 

Apache Spark commented on SPARK-10745:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9481

> Separate configs between shuffle and RPC
> 
>
> Key: SPARK-10745
> URL: https://issues.apache.org/jira/browse/SPARK-10745
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>
> SPARK-6028 uses network module to implement RPC. However, there are some 
> configurations named with `spark.shuffle` prefix in the network module. We 
> should refactor them and make sure the user can control them in shuffle and 
> RPC separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7425) spark.ml Predictor should support other numeric types for label

2015-11-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990917#comment-14990917
 ] 

Joseph K. Bradley commented on SPARK-7425:
--

The VectorUDT usage for features should be a separate issue from this JIRA 
(which is for the label).

> spark.ml Predictor should support other numeric types for label
> ---
>
> Key: SPARK-7425
> URL: https://issues.apache.org/jira/browse/SPARK-7425
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>  Labels: starter
>
> Currently, the Predictor abstraction expects the input labelCol type to be 
> DoubleType, but we should support other numeric types.  This will involve 
> updating the PredictorParams.validateAndTransformSchema method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-9465) Could not read parquet table after recreating it with the same table name

2015-11-04 Thread Xin Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-9465:
--
Comment: was deleted

(was: I can not recreate the issue on 1.5.1 or 1.6.0..

{code}
scala> sqlContext.sql("create table parquet_table stored as parquet as select 
1")
scala> sqlContext.sql("select * from parquet_table").show
+---+
|_c0|
+---+
|  1|
+---+
scala> val df = sqlContext.sql("select * from parquet_a")
scala> df.show
+---+
|_c0|
+---+
|  2|
+---+
scala> df.write.mode(SaveMode.Overwrite).saveAsTable("parquet_table")
scala> sqlContext.sql("select * from parquet_table").show
+---+
|_c0|
+---+
|  2|
+---+

{code}
The data from `parquet_a` overwrites the data in `parquet_table`)

> Could not read parquet table after recreating it with the same table name
> -
>
> Key: SPARK-9465
> URL: https://issues.apache.org/jira/browse/SPARK-9465
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: StanZhai
>
> I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet 
> table after recreating it, we can reproduce the error as following: 
> {code}
> // hc is an instance of HiveContext 
> hc.sql("select * from b").show() // this is ok and b is a parquet 
> table 
> val df = hc.sql("select * from a") 
> df.write.mode(SaveMode.Overwrite).saveAsTable("b") 
> hc.sql("select * from b").show() // got error 
> {code}
> The error is: 
> {code}
> java.io.FileNotFoundException: File does not exist: 
> /user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) 
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>  
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) 
> at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>  
> at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>  
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1132) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1182) 
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:218)
>  
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:214)
>  
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:214)
>  
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:20

[jira] [Commented] (SPARK-9465) Could not read parquet table after recreating it with the same table name

2015-11-04 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990909#comment-14990909
 ] 

Xin Wu commented on SPARK-9465:
---

I can not recreate the issue on 1.5.1 or 1.6.0..

{code}
scala> sqlContext.sql("create table parquet_table stored as parquet as select 
1")
scala> sqlContext.sql("select * from parquet_table").show
+---+
|_c0|
+---+
|  1|
+---+
scala> val df = sqlContext.sql("select * from parquet_a")
scala> df.show
+---+
|_c0|
+---+
|  2|
+---+
scala> df.write.mode(SaveMode.Overwrite).saveAsTable("parquet_table")
scala> sqlContext.sql("select * from parquet_table").show
+---+
|_c0|
+---+
|  2|
+---+

{code}
The data from `parquet_a` overwrites the data in `parquet_table`

> Could not read parquet table after recreating it with the same table name
> -
>
> Key: SPARK-9465
> URL: https://issues.apache.org/jira/browse/SPARK-9465
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: StanZhai
>
> I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet 
> table after recreating it, we can reproduce the error as following: 
> {code}
> // hc is an instance of HiveContext 
> hc.sql("select * from b").show() // this is ok and b is a parquet 
> table 
> val df = hc.sql("select * from a") 
> df.write.mode(SaveMode.Overwrite).saveAsTable("b") 
> hc.sql("select * from b").show() // got error 
> {code}
> The error is: 
> {code}
> java.io.FileNotFoundException: File does not exist: 
> /user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) 
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>  
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) 
> at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>  
> at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>  
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1132) 
> at 
> org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1182) 
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:218)
>  
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:214)
>  
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:214)
>  
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(Dist

[jira] [Commented] (SPARK-9722) Pass random seed to spark.ml RandomForest findSplitsBins

2015-11-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990899#comment-14990899
 ] 

Joseph K. Bradley commented on SPARK-9722:
--

Great, thank you!

> Pass random seed to spark.ml RandomForest findSplitsBins
> 
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10371) Optimize sequential projections

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10371:


Assignee: Apache Spark

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10371) Optimize sequential projections

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990894#comment-14990894
 ] 

Apache Spark commented on SPARK-10371:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/9480

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10371) Optimize sequential projections

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10371:


Assignee: (was: Apache Spark)

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9465) Could not read parquet table after recreating it with the same table name

2015-11-04 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990890#comment-14990890
 ] 

Xin Wu commented on SPARK-9465:
---

I tried on both 1.5.1 and 1.6.0, I can not recreate the issue

{code}
scala> sqlContext.sql("create table parquet_table stored as parquet as select 
1")
scala> sqlContext.sql("select * from parquet_table").show
+---+
|_c0|
+---+
|  1|
+---+

scala> sqlContext.sql("create table parquet_d stored as parquet as select 10 ")
scala> val df = sqlContext.sql("select * from parquet_d")
scala> df.show
15/11/04 17:14:05 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+---+
|_c0|
+---+
| 10|
+---+
scala> df.write.mode(SaveMode.Overwrite).saveAsTable("parquet_table")
scala> sqlContext.sql("select * from parquet_table").show
15/11/04 17:15:26 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+---+
|_c0|
+---+
| 10|
+---+

{code}

I have table `parquet_table` having one record with value `1` first and created 
`parquet_d` having one record with value `10` and  overwrite the DataFrame of 
`parquet_d` to table `parquet_table`..  The select of `parquet_table` returns 
the correct result. 

> Could not read parquet table after recreating it with the same table name
> -
>
> Key: SPARK-9465
> URL: https://issues.apache.org/jira/browse/SPARK-9465
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: StanZhai
>
> I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet 
> table after recreating it, we can reproduce the error as following: 
> {code}
> // hc is an instance of HiveContext 
> hc.sql("select * from b").show() // this is ok and b is a parquet 
> table 
> val df = hc.sql("select * from a") 
> df.write.mode(SaveMode.Overwrite).saveAsTable("b") 
> hc.sql("select * from b").show() // got error 
> {code}
> The error is: 
> {code}
> java.io.FileNotFoundException: File does not exist: 
> /user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65) 
> at 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55) 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) 
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) 
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>  
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) 
> at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
>  
> at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
>  
> at 
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144) 
> at 
> o

[jira] [Resolved] (SPARK-11307) Reduce memory consumption of OutputCommitCoordinator bookkeeping structures

2015-11-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11307.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9274
[https://github.com/apache/spark/pull/9274]

> Reduce memory consumption of OutputCommitCoordinator bookkeeping structures
> ---
>
> Key: SPARK-11307
> URL: https://issues.apache.org/jira/browse/SPARK-11307
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.0
>
>
> OutputCommitCoordinator uses a map in a place where an array would suffice, 
> increasing its memory consumption for result stages with millions of tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11398) unnecessary def dialectClassName in HiveContext, and misleading dialect conf at the start of spark-sql

2015-11-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11398.

   Resolution: Fixed
Fix Version/s: 1.6.0

> unnecessary def dialectClassName in HiveContext, and misleading dialect conf 
> at the start of spark-sql
> --
>
> Key: SPARK-11398
> URL: https://issues.apache.org/jira/browse/SPARK-11398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Zhenhua Wang
>Priority: Minor
> Fix For: 1.6.0
>
>
> 1. def dialectClassName in HiveContext is unnecessary. 
> In HiveContext, if conf.dialect == "hiveql", getSQLDialect() will return new 
> HiveQLDialect(this);
> else it will use super.getSQLDialect(). Then in super.getSQLDialect(), it 
> calls dialectClassName, which is overriden in HiveContext and still return 
> super.dialectClassName.
> So we'll never reach the code "classOf[HiveQLDialect].getCanonicalName" of 
> def dialectClassName in HiveContext.
> 2. When we start bin/spark-sql, the default context is HiveContext, and the 
> corresponding dialect is hiveql.
> However, if we type "set spark.sql.dialect;", the result is "sql", which is 
> inconsistent with the actual dialect and is misleading. For example, we can 
> use sql like "create table" which is only allowed in hiveql, but this dialect 
> conf shows it's "sql".
> Although this problem will not cause any execution error, it's misleading to 
> spark sql users. Therefore I think we should fix it.
> In this pr, instead of overriding def dialect in conf of HiveContext, I set 
> the SQLConf.DIALECT directly as "hiveql", such that result of "set 
> spark.sql.dialect;" will be "hiveql", not "sql". After the change, we can 
> still use "sql" as the dialect in HiveContext through "set 
> spark.sql.dialect=sql". Then the conf.dialect in HiveContext will become sql. 
> Because in SQLConf, def dialect = getConf(), and now the dialect in 
> "settings" becomes "sql".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9722) Pass random seed to spark.ml RandomForest findSplitsBins

2015-11-04 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990877#comment-14990877
 ] 

Yu Ishikawa commented on SPARK-9722:


[~josephkb] I'll add a seed Param to {{DecisionTreeClassifier}} and 
{{DecisionTreeRegressor}}.

> Pass random seed to spark.ml RandomForest findSplitsBins
> 
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990868#comment-14990868
 ] 

Cheng Hao commented on SPARK-11512:
---

We need to support the "bucket" for DataSource API.

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990867#comment-14990867
 ] 

Cheng Hao commented on SPARK-11512:
---

Oh, yes, but SPARK-5292 is only about to support the Hive bucket, but in a 
generic way, we need to add support the bucket for Data Source API. Anyway, I 
will add a link with that jira issue.

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9722) Pass random seed to spark.ml RandomForest findSplitsBins

2015-11-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9722:
-
Summary: Pass random seed to spark.ml RandomForest findSplitsBins  (was: 
Pass random seed to spark.ml DecisionTree*)

> Pass random seed to spark.ml RandomForest findSplitsBins
> 
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9722) Pass random seed to spark.ml DecisionTree*

2015-11-04 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990866#comment-14990866
 ] 

Joseph K. Bradley commented on SPARK-9722:
--

[~yuu.ishik...@gmail.com] Thanks for the PR!  Sorry I was slow to get to it.  
Could you please add a seed Param to DecisionTreeClassifier and 
DecisionTreeRegressor?  I'll create a new JIRA and link it here.

I hope we can squeeze it into 1.6.  If we can't, then we should check to make 
sure XORShiftRandom behaves nicely when given a seed of 0 (which is the 
behavior after this PR's JIRA).


> Pass random seed to spark.ml DecisionTree*
> --
>
> Key: SPARK-9722
> URL: https://issues.apache.org/jira/browse/SPARK-9722
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Trees use XORShiftRandom when binning continuous features.  Currently, they 
> use a fixed seed of 1.  They should accept a random seed param and use that 
> instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11514) Pass random seed to spark.ml DecisionTree*

2015-11-04 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-11514:
-

 Summary: Pass random seed to spark.ml DecisionTree*
 Key: SPARK-11514
 URL: https://issues.apache.org/jira/browse/SPARK-11514
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Yu Ishikawa






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-04 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990861#comment-14990861
 ] 

Marcelo Vanzin commented on SPARK-11512:


Isn't this the same as in SPARK-5292?

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11491) Use Scala 2.10.5

2015-11-04 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11491.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Use Scala 2.10.5
> 
>
> Key: SPARK-11491
> URL: https://issues.apache.org/jira/browse/SPARK-11491
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
> Fix For: 1.6.0
>
>
> Spark should build against Scala 2.10.5, since that includes a fix for 
> Scaladoc: https://issues.scala-lang.org/browse/SI-8479



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11513) Remove the internal implicit conversion from LogicalPlan to DataFrame

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11513:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove the internal implicit conversion from LogicalPlan to DataFrame
> -
>
> Key: SPARK-11513
> URL: https://issues.apache.org/jira/browse/SPARK-11513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> DataFrame has an internal implicit conversion that turns a LogicalPlan into a 
> DataFrame. This has been fairly confusing to a few new contributors. Since it 
> doesn't buy us much, we should just remove that implicit conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11513) Remove the internal implicit conversion from LogicalPlan to DataFrame

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11513:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove the internal implicit conversion from LogicalPlan to DataFrame
> -
>
> Key: SPARK-11513
> URL: https://issues.apache.org/jira/browse/SPARK-11513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> DataFrame has an internal implicit conversion that turns a LogicalPlan into a 
> DataFrame. This has been fairly confusing to a few new contributors. Since it 
> doesn't buy us much, we should just remove that implicit conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11513) Remove the internal implicit conversion from LogicalPlan to DataFrame

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990853#comment-14990853
 ] 

Apache Spark commented on SPARK-11513:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9479

> Remove the internal implicit conversion from LogicalPlan to DataFrame
> -
>
> Key: SPARK-11513
> URL: https://issues.apache.org/jira/browse/SPARK-11513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> DataFrame has an internal implicit conversion that turns a LogicalPlan into a 
> DataFrame. This has been fairly confusing to a few new contributors. Since it 
> doesn't buy us much, we should just remove that implicit conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11513) Remove the internal implicit conversion from LogicalPlan to DataFrame

2015-11-04 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-11513:
---

 Summary: Remove the internal implicit conversion from LogicalPlan 
to DataFrame
 Key: SPARK-11513
 URL: https://issues.apache.org/jira/browse/SPARK-11513
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


DataFrame has an internal implicit conversion that turns a LogicalPlan into a 
DataFrame. This has been fairly confusing to a few new contributors. Since it 
doesn't buy us much, we should just remove that implicit conversion.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-11-04 Thread Abhishek (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990851#comment-14990851
 ] 

Abhishek commented on SPARK-10309:
--

Is there any work around for this issue. We migrated from 1.1 to 1.5 and our 
jobs heavily depends on join. I have been trying to get rid of this exception 
but no luck.

If someone can at least point where in the code might be the issue? I tried 
doing few joins on DF instead of SQL context but that also didn't help. 
Sometime job succeeds (like 5%).

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> *=== Update ===*
> This is caused by a mismatch between 
> `Runtime.getRuntime.availableProcessors()` and the number of active tasks in 
> `ShuffleMemoryManager`. A quick reproduction is the following:
> {code}
> // My machine only has 8 cores
> $ bin/spark-shell --master local[32]
> scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b")
> scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count()
> Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:120)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$2.apply(sort.scala:143)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.prepare(MapPartitionsWithPreparationRDD.scala:50)
> {code}
> *=== Original ===*
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This m

[jira] [Created] (SPARK-11512) Bucket Join

2015-11-04 Thread Cheng Hao (JIRA)

Cheng Hao created SPARK-11512:
-

 Summary: Bucket Join
 Key: SPARK-11512
 URL: https://issues.apache.org/jira/browse/SPARK-11512
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao


Sort merge join on two datasets on the file system that have already been 
partitioned the same with the same number of partitions and sorted within each 
partition, and we don't need to sort it again while join with the 
sorted/partitioned keys

This functionality exists in
- Hive (hive.optimize.bucketmapjoin.sortedmerge)
- Pig (USING 'merge')
- MapReduce (CompositeInputFormat)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11510) Remove some SQL aggregation tests

2015-11-04 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11510.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9475
[https://github.com/apache/spark/pull/9475]

> Remove some SQL aggregation tests
> -
>
> Key: SPARK-11510
> URL: https://issues.apache.org/jira/browse/SPARK-11510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>
> We have some aggregate function tests in both DataFrameAggregateSuite and 
> SQLQuerySuite. The two have almost the same coverage and we should just 
> remove the SQL one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10387) Code generation for decision tree

2015-11-04 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990809#comment-14990809
 ] 

holdenk commented on SPARK-10387:
-

Progress - although I'm a little uncertain of what the best API is for this. 
I'm thinking that we need to chose one of these:
1) Pick a threshold number of trees above which we skip codegen
2) Make #1 configurable through spark context
3) Provide an explicit codeGen or toCodeGen on the model which the user can 
call.
4) Something else entirely
What are peoples thoughts? I'm probably going to proceed with #1 for now and 
just try and get something working end to end as a starting point but I'd 
appreciate peoples thoughts on how best to expose this functionality.

> Code generation for decision tree
> -
>
> Key: SPARK-10387
> URL: https://issues.apache.org/jira/browse/SPARK-10387
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: DB Tsai
>
> Provide code generation for decision tree and tree ensembles. Let's first 
> discuss the design and then create new JIRAs for tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6001) K-Means clusterer should return the assignments of input points to clusters

2015-11-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-6001.
--
   Resolution: Fixed
 Assignee: Yu Ishikawa
Fix Version/s: 1.5.0

Yep, thanks for pinging!

> K-Means clusterer should return the assignments of input points to clusters
> ---
>
> Key: SPARK-6001
> URL: https://issues.apache.org/jira/browse/SPARK-6001
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.1
>Reporter: Derrick Burns
>Assignee: Yu Ishikawa
>Priority: Minor
> Fix For: 1.5.0
>
>
> The K-Means clusterer returns a KMeansModel that contains the cluster 
> centers. However, when available, I suggest that the K-Means clusterer also 
> return an RDD of the assignments of the input data to the clusters. While the 
> assignments can be computed given the KMeansModel, why not return assignments 
> if they are available to save re-computation costs.
> The K-means implementation at 
> https://github.com/derrickburns/generalized-kmeans-clustering returns the 
> assignments when available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7332) RpcCallContext.sender has a different name from the original sender's name

2015-11-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu closed SPARK-7332.
---
Resolution: Won't Fix

They are internal APIs and not exposed to the user.

> RpcCallContext.sender has a different name from the original sender's name
> --
>
> Key: SPARK-7332
> URL: https://issues.apache.org/jira/browse/SPARK-7332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Qiping Li
>Assignee: Shixiong Zhu
>Priority: Critical
>
> In the function {{receiveAndReply}} of {{RpcEndpoint}}, we get the sender of 
> the received message through {{context.sender}}. But this doesn't work 
> because we don't get the right {{RpcEndpointRef}}. It's name is different 
> from the original sender's name, so the path is different.
> Here is the code to test it:
> {code}
> case class Greeting(who: String)
> class GreetingActor(override val rpcEnv: RpcEnv) extends RpcEndpoint with 
> Logging {
>   override def receiveAndReply(context: RpcCallContext) : 
> PartialFunction[Any, Unit] = {
> case Greeting(who) =>
>   logInfo("Hello " + who)
>   logInfo(s"${context.sender.name}")
>   }
> }
> class ToSend(override val rpcEnv: RpcEnv, greeting: RpcEndpointRef) extends 
> RpcEndpoint with Logging {
>   override def onStart(): Unit = {
> logInfo(s"${self.name}")
> greeting.ask(Greeting("Charlie Parker"))
>   }
> }
> object RpcEndpointNameTest {
>   def main(args: Array[String]): Unit = {
> val actorSystemName = "driver"
> val conf = new SparkConf
> val rpcEnv = RpcEnv.create(actorSystemName, "localhost", 0, conf, new 
> SecurityManager(conf))
> val greeter = rpcEnv.setupEndpoint("greeter", new GreetingActor(rpcEnv))
> rpcEnv.setupEndpoint("toSend", new ToSend(rpcEnv, greeter))
>   }
> }
> {code}
> The result was:
> {code}
> toSend
> Hello Charlie Parker
> $a
> {code}
> I test the above code using akka with the following code:
> {code}
> case class Greeting(who: String)
> class GreetingActor extends Actor with ActorLogging {
>   def receive = {
> case Greeting(who) =>
>   println("Hello " + who)
>   println(s"${sender.path} ${sender.path.name}")
>   }
> }
> class ToSend(greeting: ActorRef) extends Actor with ActorLogging {
>   override def preStart(): Unit = {
> println(s"${self.path} ${self.path.name}")
> greeting ! Greeting("Charlie Parker")
>   }
>   def receive = {
> case _ =>
>   log.info("here")
>   }
> }
> object HelloWorld {
>   def main(args: Array[String]): Unit = {
> val system = ActorSystem("MySystem")
> val greeter = system.actorOf(Props[GreetingActor], name = "greeter")
> println(s"${greeter.path} ${greeter.path.name}")
> val system2 = ActorSystem("MySystem2")
> system2.actorOf(Props(classOf[ToSend], greeter), name = "toSend_2")
>   }
> }
> {code}
> And the result was:
> {code}
> akka://MySystem/user/greeter greeter
> akka://MySystem2/user/toSend_2 toSend_2
> Hello Charlie Parker
> akka://MySystem2/user/toSend_2 toSend_2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-11-04 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990795#comment-14990795
 ] 

Reynold Xin commented on SPARK-11303:
-

This made it into 1.5.2.


> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>Assignee: Yanbo Liang
> Fix For: 1.5.2, 1.6.0
>
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> {code}
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> {code}
> output:
> {code}
> 14
> 7
> 8
> 14
> 7
> 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11103) Parquet filters push-down may cause exception when schema merging is turned on

2015-11-04 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990793#comment-14990793
 ] 

Reynold Xin commented on SPARK-11103:
-

I think this was included in 1.5.2


> Parquet filters push-down may cause exception when schema merging is turned on
> --
>
> Key: SPARK-11103
> URL: https://issues.apache.org/jira/browse/SPARK-11103
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Dominic Ricard
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 1.5.2, 1.6.0
>
>
> When evolving a schema in parquet files, spark properly expose all columns 
> found in the different parquet files but when trying to query the data, it is 
> not possible to apply a filter on a column that is not present in all files.
> To reproduce:
> *SQL:*
> {noformat}
> create table `table1` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=1/' as select 1 as `col1`;
> create table `table2` STORED AS PARQUET LOCATION 
> 'hdfs://:/path/to/table/id=2/' as select 1 as `col1`, 2 as 
> `col2`;
> create table `table3` USING org.apache.spark.sql.parquet OPTIONS (path 
> "hdfs://:/path/to/table");
> select col1 from `table3` where col2 = 2;
> {noformat}
> The last select will output the following Stack Trace:
> {noformat}
> An error occurred when executing the SQL command:
> select col1 from `table3` where col2 = 2
> [Simba][HiveJDBCDriver](500051) ERROR processing query/statement. Error Code: 
> 0, SQL state: TStatus(statusCode:ERROR_STATUS, 
> infoMessages:[*org.apache.hive.service.cli.HiveSQLException:org.apache.spark.SparkException:
>  Job aborted due to stage failure: Task 0 in stage 7212.0 failed 4 times, 
> most recent failure: Lost task 0.3 in stage 7212.0 (TID 138449, 
> 208.92.52.88): java.lang.IllegalArgumentException: Column [col2] was not 
> found in schema!
>   at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:55)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:190)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:178)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:160)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:94)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:59)
>   at 
> org.apache.parquet.filter2.predicate.Operators$Eq.accept(Operators.java:180)
>   at 
> org.apache.parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:64)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:59)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:40)
>   at 
> org.apache.parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:126)
>   at 
> org.apache.parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:46)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:160)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:155)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Ta

[jira] [Updated] (SPARK-11303) sample (without replacement) + filter returns wrong results in DataFrame

2015-11-04 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11303:

Fix Version/s: 1.5.2

> sample (without replacement) + filter returns wrong results in DataFrame
> 
>
> Key: SPARK-11303
> URL: https://issues.apache.org/jira/browse/SPARK-11303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: pyspark local mode, linux.
>Reporter: Yuval Tanny
>Assignee: Yanbo Liang
> Fix For: 1.5.2, 1.6.0
>
>
> When sampling and then filtering DataFrame from python, we get inconsistent 
> result when not caching the sampled DataFrame. This bug  doesn't appear in 
> spark 1.4.1.
> {code}
> d = sqlContext.createDataFrame(sc.parallelize([[1]] * 50 + [[2]] * 50),['t'])
> d_sampled = d.sample(False, 0.1, 1)
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> d_sampled.cache()
> print d_sampled.count()
> print d_sampled.filter('t = 1').count()
> print d_sampled.filter('t != 1').count()
> {code}
> output:
> {code}
> 14
> 7
> 8
> 14
> 7
> 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6521) executors in the same node read local shuffle file

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990791#comment-14990791
 ] 

Apache Spark commented on SPARK-6521:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9478

> executors in the same node read local shuffle file
> --
>
> Key: SPARK-6521
> URL: https://issues.apache.org/jira/browse/SPARK-6521
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: xukun
>
> In the past, executor read other executor's shuffle file in the same node by 
> net. This pr make that executors in the same node read local shuffle file In 
> sort-based Shuffle. It will reduce net transport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10648) Spark-SQL JDBC fails to set a default precision and scale when they are not defined in an oracle schema.

2015-11-04 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990785#comment-14990785
 ] 

Yin Huai commented on SPARK-10648:
--

https://github.com/apache/spark/pull/8780#issuecomment-145598968 and 
https://github.com/apache/spark/pull/8780#issuecomment-144541760 have the 
workaround.

> Spark-SQL JDBC fails to set a default precision and scale when they are not 
> defined in an oracle schema.
> 
>
> Key: SPARK-10648
> URL: https://issues.apache.org/jira/browse/SPARK-10648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: using oracle 11g, ojdbc7.jar
>Reporter: Travis Hegner
>
> Using oracle 11g as a datasource with ojdbc7.jar. When importing data into a 
> scala app, I am getting an exception "Overflowed precision". Some times I 
> would get the exception "Unscaled value too large for precision".
> This issue likely affects older versions as well, but this was the version I 
> verified it on.
> I narrowed it down to the fact that the schema detection system was trying to 
> set the precision to 0, and the scale to -127.
> I have a proposed pull request to follow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7542) Support off-heap sort buffer in UnsafeExternalSorter

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990761#comment-14990761
 ] 

Apache Spark commented on SPARK-7542:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9477

> Support off-heap sort buffer in UnsafeExternalSorter
> 
>
> Key: SPARK-7542
> URL: https://issues.apache.org/jira/browse/SPARK-7542
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>
> {{UnsafeExternalSorter}}, introduced in SPARK-7081, uses on-heap {{long[]}} 
> arrays as its sort buffers.  When records are small, the sorting array might 
> be as large as the data pages, so it would be useful to be able to allocate 
> this array off-heap (using our unsafe LongArray).  Unfortunately, we can't 
> currently do this because TimSort calls {{allocate()}} to create data buffers 
> but doesn't call any corresponding cleanup methods to free them.
> We should look into extending TimSort with buffer freeing methods, then 
> consider switching to LongArray in UnsafeShuffleSortDataFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7542) Support off-heap sort buffer in UnsafeExternalSorter

2015-11-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-7542:
-

Assignee: Davies Liu

> Support off-heap sort buffer in UnsafeExternalSorter
> 
>
> Key: SPARK-7542
> URL: https://issues.apache.org/jira/browse/SPARK-7542
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>
> {{UnsafeExternalSorter}}, introduced in SPARK-7081, uses on-heap {{long[]}} 
> arrays as its sort buffers.  When records are small, the sorting array might 
> be as large as the data pages, so it would be useful to be able to allocate 
> this array off-heap (using our unsafe LongArray).  Unfortunately, we can't 
> currently do this because TimSort calls {{allocate()}} to create data buffers 
> but doesn't call any corresponding cleanup methods to free them.
> We should look into extending TimSort with buffer freeing methods, then 
> consider switching to LongArray in UnsafeShuffleSortDataFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

2015-11-04 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990746#comment-14990746
 ] 

Andrew Davidson commented on SPARK-11509:
-

I forgot to mentioned. on my cluster master I was able to run 

bin/pyspark --master local[2] 

without any problems . I was able to access sc with any issues



> ipython notebooks do not work on clusters created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
> --
>
> Key: SPARK-11509
> URL: https://issues.apache.org/jira/browse/SPARK-11509
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, EC2, PySpark
>Affects Versions: 1.5.1
> Environment: AWS cluster
> [ec2-user@ip-172-31-29-60 ~]$ uname -a
> Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 
> SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Andrew Davidson
>
> I recently downloaded  spark-1.5.1-bin-hadoop2.6 to my local mac.
> I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am 
> able to run the java SparkPi example on the cluster how ever I am not able to 
> run ipython notebooks on the cluster. (I connect using ssh tunnel)
> According to the 1.5.1 getting started doc 
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
> The following should work
>  PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook 
> --no-browser --port=7000" /root/spark/bin/pyspark
> I am able to connect to the notebook server and start a notebook how ever
> bug 1) the default sparkContext does not exist
> from pyspark import SparkContext
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3
> ---
> NameError Traceback (most recent call last)
>  in ()
>   1 from pyspark import SparkContext
> > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
>   3 textFile.take(3)
> NameError: name 'sc' is not defined
> bug 2)
>  If I create a SparkContext I get the following python versions miss match 
> error
> sc = SparkContext("local", "Simple App")
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3)
>  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.7 than that in driver 
> 2.6, PySpark cannot run with different minor versions
> I am able to run ipython notebooks on my local Mac as follows. (by default 
> you would get an error that the driver and works are using different version 
> of python)
> $ cat ~/bin/pySparkNotebook.sh
> #!/bin/sh 
> set -x # turn debugging on
> #set +x # turn debugging off
> export PYSPARK_PYTHON=python3
> export PYSPARK_DRIVER_PYTHON=python3
> IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ 
> I have spent a lot of time trying to debug the pyspark script however I can 
> not figure out what the problem is
> Please let me know if there is something I can do to help
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

2015-11-04 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990742#comment-14990742
 ] 

Andrew Davidson commented on SPARK-11509:
-

yes , it appears the show stopper issue I am facing is the python versions to 
do not match. How ever on my local mac I was able to figure out how to get 
everything to match. That technique does not work on a spark cluster. I tried a 
lot of hacking how ever can not seem to get the version to match. I plan to 
install python 3 on all machines maybe that will work better

kind regards

Andy

> ipython notebooks do not work on clusters created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
> --
>
> Key: SPARK-11509
> URL: https://issues.apache.org/jira/browse/SPARK-11509
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, EC2, PySpark
>Affects Versions: 1.5.1
> Environment: AWS cluster
> [ec2-user@ip-172-31-29-60 ~]$ uname -a
> Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 
> SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Andrew Davidson
>
> I recently downloaded  spark-1.5.1-bin-hadoop2.6 to my local mac.
> I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am 
> able to run the java SparkPi example on the cluster how ever I am not able to 
> run ipython notebooks on the cluster. (I connect using ssh tunnel)
> According to the 1.5.1 getting started doc 
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
> The following should work
>  PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook 
> --no-browser --port=7000" /root/spark/bin/pyspark
> I am able to connect to the notebook server and start a notebook how ever
> bug 1) the default sparkContext does not exist
> from pyspark import SparkContext
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3
> ---
> NameError Traceback (most recent call last)
>  in ()
>   1 from pyspark import SparkContext
> > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
>   3 textFile.take(3)
> NameError: name 'sc' is not defined
> bug 2)
>  If I create a SparkContext I get the following python versions miss match 
> error
> sc = SparkContext("local", "Simple App")
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3)
>  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.7 than that in driver 
> 2.6, PySpark cannot run with different minor versions
> I am able to run ipython notebooks on my local Mac as follows. (by default 
> you would get an error that the driver and works are using different version 
> of python)
> $ cat ~/bin/pySparkNotebook.sh
> #!/bin/sh 
> set -x # turn debugging on
> #set +x # turn debugging off
> export PYSPARK_PYTHON=python3
> export PYSPARK_DRIVER_PYTHON=python3
> IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ 
> I have spent a lot of time trying to debug the pyspark script however I can 
> not figure out what the problem is
> Please let me know if there is something I can do to help
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10028) Add Python API for PrefixSpan

2015-11-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10028.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9469
[https://github.com/apache/spark/pull/9469]

> Add Python API for PrefixSpan
> -
>
> Key: SPARK-10028
> URL: https://issues.apache.org/jira/browse/SPARK-10028
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Yu Ishikawa
> Fix For: 1.6.0
>
>
> Add Python API for mllib.fpm.PrefixSpan



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11459) Allow configuring checkpoint dir, filenames

2015-11-04 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990676#comment-14990676
 ] 

Ryan Williams commented on SPARK-11459:
---

I'm mostly interested in saving RDDs to disk with kryo-serde 
([SPARK-11461|https://issues.apache.org/jira/browse/SPARK-11461]).

The existing checkpoint APIs are functionally exactly what I want, but they 
mandate putting a UUID in the directory name and fixing the basename to the RDD 
ID, somewhat unnecessarily.

Letting the user opt in to specifying the path is a simple way to get at the 
functionality that I want without having to do something possibly more invasive 
e.g. for SPARK-11461, and there's not really a danger of it conflicting with 
existing checkpoint usages. It could even be exposed via a different method on 
SparkContext/RDD, if overloading the semantics of {{checkpoint}} is the concern.

Another, orthogonal option I've worked on a little is basically copy/pasting a 
bunch of the checkpointing logic into a Spark package that hangs methods off of 
SparkContext and RDD that do [checkpointing with configurable path naming], but 
that's an unideal use-case for Spark packages: it is a bunch of code that's 
already in Spark, that I'd have to keep up to date, etc.

> Allow configuring checkpoint dir, filenames
> ---
>
> Key: SPARK-11459
> URL: https://issues.apache.org/jira/browse/SPARK-11459
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> I frequently want to persist some RDDs to disk and choose the names of the 
> files that they are saved as.
> Currently, the {{RDD.checkpoint}} flow [writes to a directory with a UUID in 
> its 
> name|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L2050],
>  and the file is [always named after the RDD's 
> ID|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala#L96].
> Is there any reason not to allow the user to e.g. pass a string to 
> {{RDD.checkpoint}} that will set the location that the RDD is checkpointed to?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11499) Spark History Server UI should respect protocol when doing redirection

2015-11-04 Thread Lukasz Jastrzebski (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990674#comment-14990674
 ] 

Lukasz Jastrzebski commented on SPARK-11499:


There is also https://en.wikipedia.org/wiki/X-Forwarded-For header that HS 
could support.

> Spark History Server UI should respect protocol when doing redirection
> --
>
> Key: SPARK-11499
> URL: https://issues.apache.org/jira/browse/SPARK-11499
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Lukasz Jastrzebski
>
> Use case:
> Spark history server is behind load balancer secured with ssl certificate,
> unfortunately clicking on the application link redirects it to http protocol, 
> which may be not expose by load balancer, example flow:
> *   Trying 52.22.220.1...
> * Connected to xxx.yyy.com (52.22.220.1) port 8775 (#0)
> * WARNING: SSL: Certificate type not set, assuming PKCS#12 format.
> * Client certificate: u...@yyy.com
> * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
> * Server certificate: *.yyy.com
> * Server certificate: Entrust Certification Authority - L1K
> * Server certificate: Entrust Root Certification Authority - G2
> > GET /history/20151030-160604-3039174572-5951-22401-0004 HTTP/1.1
> > Host: xxx.yyy.com:8775
> > User-Agent: curl/7.43.0
> > Accept: */*
> >
> < HTTP/1.1 302 Found
> < Location: 
> http://xxx.yyy.com:8775/history/20151030-160604-3039174572-5951-22401-0004
> < Connection: close
> < Server: Jetty(8.y.z-SNAPSHOT)
> <
> * Closing connection 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11511) Creating an InputDStream but not using it throws NPE

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990639#comment-14990639
 ] 

Apache Spark commented on SPARK-11511:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9476

> Creating an InputDStream but not using it throws NPE
> 
>
> Key: SPARK-11511
> URL: https://issues.apache.org/jira/browse/SPARK-11511
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> If an InputDStream is not used, its rememberDuration will null and 
> DStreamGraph.getMaxInputStreamRememberDuration will throw NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11511) Creating an InputDStream but not using it throws NPE

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11511:


Assignee: (was: Apache Spark)

> Creating an InputDStream but not using it throws NPE
> 
>
> Key: SPARK-11511
> URL: https://issues.apache.org/jira/browse/SPARK-11511
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>
> If an InputDStream is not used, its rememberDuration will null and 
> DStreamGraph.getMaxInputStreamRememberDuration will throw NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11511) Creating an InputDStream but not using it throws NPE

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11511:


Assignee: Apache Spark

> Creating an InputDStream but not using it throws NPE
> 
>
> Key: SPARK-11511
> URL: https://issues.apache.org/jira/browse/SPARK-11511
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> If an InputDStream is not used, its rememberDuration will null and 
> DStreamGraph.getMaxInputStreamRememberDuration will throw NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11511) Creating an InputDStream but not using it throws NPE

2015-11-04 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-11511:


 Summary: Creating an InputDStream but not using it throws NPE
 Key: SPARK-11511
 URL: https://issues.apache.org/jira/browse/SPARK-11511
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu


If an InputDStream is not used, its rememberDuration will null and 
DStreamGraph.getMaxInputStreamRememberDuration will throw NPE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11510) Remove some SQL aggregation tests

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11510:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove some SQL aggregation tests
> -
>
> Key: SPARK-11510
> URL: https://issues.apache.org/jira/browse/SPARK-11510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We have some aggregate function tests in both DataFrameAggregateSuite and 
> SQLQuerySuite. The two have almost the same coverage and we should just 
> remove the SQL one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11510) Remove some SQL aggregation tests

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11510:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove some SQL aggregation tests
> -
>
> Key: SPARK-11510
> URL: https://issues.apache.org/jira/browse/SPARK-11510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We have some aggregate function tests in both DataFrameAggregateSuite and 
> SQLQuerySuite. The two have almost the same coverage and we should just 
> remove the SQL one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11510) Remove some SQL aggregation tests

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990615#comment-14990615
 ] 

Apache Spark commented on SPARK-11510:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9475

> Remove some SQL aggregation tests
> -
>
> Key: SPARK-11510
> URL: https://issues.apache.org/jira/browse/SPARK-11510
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We have some aggregate function tests in both DataFrameAggregateSuite and 
> SQLQuerySuite. The two have almost the same coverage and we should just 
> remove the SQL one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11493) Remove Bitset in BytesToBytesMap

2015-11-04 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11493.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9452
[https://github.com/apache/spark/pull/9452]

> Remove Bitset in BytesToBytesMap
> 
>
> Key: SPARK-11493
> URL: https://issues.apache.org/jira/browse/SPARK-11493
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> Since we have 4 bytes as number of records in the beginning of a page, then 
> the address can not be zero, so we do not need the bitset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11510) Remove some SQL aggregation tests

2015-11-04 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-11510:
---

 Summary: Remove some SQL aggregation tests
 Key: SPARK-11510
 URL: https://issues.apache.org/jira/browse/SPARK-11510
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We have some aggregate function tests in both DataFrameAggregateSuite and 
SQLQuerySuite. The two have almost the same coverage and we should just remove 
the SQL one.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11459) Allow configuring checkpoint dir, filenames

2015-11-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11459:
--
Priority: Minor  (was: Major)

What's the use case for this? you can already control the directory, but what 
Spark puts in it is an implementation detail you don't generally want to rely on

> Allow configuring checkpoint dir, filenames
> ---
>
> Key: SPARK-11459
> URL: https://issues.apache.org/jira/browse/SPARK-11459
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> I frequently want to persist some RDDs to disk and choose the names of the 
> files that they are saved as.
> Currently, the {{RDD.checkpoint}} flow [writes to a directory with a UUID in 
> its 
> name|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/SparkContext.scala#L2050],
>  and the file is [always named after the RDD's 
> ID|https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala#L96].
> Is there any reason not to allow the user to e.g. pass a string to 
> {{RDD.checkpoint}} that will set the location that the RDD is checkpointed to?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

2015-11-04 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990600#comment-14990600
 ] 

Sean Owen commented on SPARK-11509:
---

This ultimately means the initialization failed. In this situation you have to 
dig in the logs to see why it did, but that's why sc isn't available. 

But I think you ended up finding the reason there: Python version mismatch 
right?

> ipython notebooks do not work on clusters created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
> --
>
> Key: SPARK-11509
> URL: https://issues.apache.org/jira/browse/SPARK-11509
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, EC2, PySpark
>Affects Versions: 1.5.1
> Environment: AWS cluster
> [ec2-user@ip-172-31-29-60 ~]$ uname -a
> Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 
> SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Andrew Davidson
>
> I recently downloaded  spark-1.5.1-bin-hadoop2.6 to my local mac.
> I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am 
> able to run the java SparkPi example on the cluster how ever I am not able to 
> run ipython notebooks on the cluster. (I connect using ssh tunnel)
> According to the 1.5.1 getting started doc 
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
> The following should work
>  PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook 
> --no-browser --port=7000" /root/spark/bin/pyspark
> I am able to connect to the notebook server and start a notebook how ever
> bug 1) the default sparkContext does not exist
> from pyspark import SparkContext
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3
> ---
> NameError Traceback (most recent call last)
>  in ()
>   1 from pyspark import SparkContext
> > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
>   3 textFile.take(3)
> NameError: name 'sc' is not defined
> bug 2)
>  If I create a SparkContext I get the following python versions miss match 
> error
> sc = SparkContext("local", "Simple App")
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3)
>  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.7 than that in driver 
> 2.6, PySpark cannot run with different minor versions
> I am able to run ipython notebooks on my local Mac as follows. (by default 
> you would get an error that the driver and works are using different version 
> of python)
> $ cat ~/bin/pySparkNotebook.sh
> #!/bin/sh 
> set -x # turn debugging on
> #set +x # turn debugging off
> export PYSPARK_PYTHON=python3
> export PYSPARK_DRIVER_PYTHON=python3
> IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ 
> I have spent a lot of time trying to debug the pyspark script however I can 
> not figure out what the problem is
> Please let me know if there is something I can do to help
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

2015-11-04 Thread Andrew Davidson (JIRA)

Andrew Davidson created SPARK-11509:
---

 Summary: ipython notebooks do not work on clusters created using 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
 Key: SPARK-11509
 URL: https://issues.apache.org/jira/browse/SPARK-11509
 Project: Spark
  Issue Type: Bug
  Components: Documentation, EC2, PySpark
Affects Versions: 1.5.1
 Environment: AWS cluster
[ec2-user@ip-172-31-29-60 ~]$ uname -a
Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 
SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Reporter: Andrew Davidson


I recently downloaded  spark-1.5.1-bin-hadoop2.6 to my local mac.

I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am 
able to run the java SparkPi example on the cluster how ever I am not able to 
run ipython notebooks on the cluster. (I connect using ssh tunnel)

According to the 1.5.1 getting started doc 
http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell

The following should work

 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook 
--no-browser --port=7000" /root/spark/bin/pyspark

I am able to connect to the notebook server and start a notebook how ever

bug 1) the default sparkContext does not exist

from pyspark import SparkContext
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
textFile.take(3

---
NameError Traceback (most recent call last)
 in ()
  1 from pyspark import SparkContext
> 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
  3 textFile.take(3)

NameError: name 'sc' is not defined

bug 2)
 If I create a SparkContext I get the following python versions miss match error

sc = SparkContext("local", "Simple App")
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
textFile.take(3)

 File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 2.6, 
PySpark cannot run with different minor versions


I am able to run ipython notebooks on my local Mac as follows. (by default you 
would get an error that the driver and works are using different version of 
python)

$ cat ~/bin/pySparkNotebook.sh
#!/bin/sh 

set -x # turn debugging on
#set +x # turn debugging off

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ 


I have spent a lot of time trying to debug the pyspark script however I can not 
figure out what the problem is

Please let me know if there is something I can do to help

Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10788:


Assignee: Apache Spark

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-11-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10788:


Assignee: (was: Apache Spark)

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features

2015-11-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14990576#comment-14990576
 ] 

Apache Spark commented on SPARK-10788:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/9474

> Decision Tree duplicates bins for unordered categorical features
> 
>
> Key: SPARK-10788
> URL: https://issues.apache.org/jira/browse/SPARK-10788
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Decision trees in spark.ml (RandomForest.scala) communicate twice as much 
> data as needed for unordered categorical features.  Here's an example.
> Say there are 3 categories A, B, C.  We consider 3 splits:
> * A vs. B, C
> * A, B vs. C
> * A, C vs. B
> Currently, we collect statistics for each of the 6 subsets of categories (3 * 
> 2 = 6).  However, we could instead collect statistics for the 3 subsets on 
> the left-hand side of the 3 possible splits: A and A,B and A,C.  If we also 
> have stats for the entire node, then we can compute the stats for the 3 
> subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = 
> stats(A,B,C) - stats(A)}}.
> We should eliminate these extra bins within the spark.ml implementation since 
> the spark.mllib implementation will be removed before long (and will instead 
> call into spark.ml).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 249 matches

Mail list logo