[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-28 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187768#comment-14187768
 ] 

Cheng Lian commented on SPARK-3683:
---

Actually a Hive session was illustrated in SPARK-1959, and seems that Hive 
interprets {{"NULL"}} as a literal string whose contents is "NULL" rather than 
a null value.

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2215) Multi-way join

2014-10-28 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187798#comment-14187798
 ] 

Reynold Xin commented on SPARK-2215:


I think a simplified version of the multi-way join would make sense, i.e. one 
that does multi-way inner-equi-broadcast join.

> Multi-way join
> --
>
> Key: SPARK-2215
> URL: https://issues.apache.org/jira/browse/SPARK-2215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> Support the multi-way join (multiple table joins) in a single reduce stage if 
> they have the same join keys.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-10-28 Thread zzc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187832#comment-14187832
 ] 

zzc commented on SPARK-2468:


Hi,Reynold Xin, When version 1.2 released, approximate time?

> Netty-based block server / client module
> 
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-10-28 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187837#comment-14187837
 ] 

Reynold Xin commented on SPARK-2468:


Take a look here https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage

> Netty-based block server / client module
> 
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187861#comment-14187861
 ] 

Nicholas Chammas commented on SPARK-3398:
-

So I spun up an Ubuntu server on EC2 and was able to reproduce this issue. For 
some reason, the call to SSH in the [referenced 
line|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark_ec2.py#L615]
 fails because it can't find the {{pem}} file passed in to {{spark-ec2}}.

Strange. I'm looking into why.

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3466) Limit size of results that a driver collects for each action

2014-10-28 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3466:
-
Priority: Critical  (was: Major)

> Limit size of results that a driver collects for each action
> 
>
> Key: SPARK-3466
> URL: https://issues.apache.org/jira/browse/SPARK-3466
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Assignee: Davies Liu
>Priority: Critical
>
> Right now, operations like {{collect()}} and {{take()}} can crash the driver 
> with an OOM if they bring back too many data. We should add a 
> {{spark.driver.maxResultSize}} setting (or something like that) that will 
> make the driver abort a job if its result is too big. We can set it to some 
> fraction of the driver's memory by default, or to something like 100 MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-10-28 Thread zzc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187881#comment-14187881
 ] 

zzc commented on SPARK-2468:


thanks

> Netty-based block server / client module
> 
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4126) Do not set `spark.executor.instances` if not needed (yarn-cluster)

2014-10-28 Thread Andrew Or (JIRA)
Andrew Or created SPARK-4126:


 Summary: Do not set `spark.executor.instances` if not needed 
(yarn-cluster)
 Key: SPARK-4126
 URL: https://issues.apache.org/jira/browse/SPARK-4126
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


In yarn cluster mode, we currently always set `spark.executor.instances` 
regardless of whether this is set by the user. While not a huge deal, this 
prevents us from knowing whether the user did specify a starting number of 
executors.

This is needed in SPARK-3795 to throw the appropriate exception when this is 
set AND dynamic executor allocation is turned on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187898#comment-14187898
 ] 

Nicholas Chammas commented on SPARK-3398:
-

I think I've found the issue. It doesn't have anything to do with Ubuntu or 
with {{wait_for_cluster_state}}.

[~michael.griffiths] - Did {{spark-ec2 launch --resume}} and {{spark-ec2 
login}} ultimately work for you to the point where you had a working Spark EC2 
cluster? Or are you not sure if in the end you were able to get a working 
cluster?

What I'm seeing is that the issue is specifying the path to the SSH Identity 
file relative to the current working directory vs. absolutely.

Do you still see the same issue if you specify the path to the Identity file 
absolutely?

That is:

{code}
# Currently not working
spark-ec2 -i ../my.pem
{code}

{code}
# Should work
spark-ec2 -i ~/my.pem
spark-ec2 -i /home/me/my.pem
{code}

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187900#comment-14187900
 ] 

Nicholas Chammas commented on SPARK-3398:
-

If that fixes it for you, then I think the solution is simple. We just need to 
set {{cwd}} to the user's current working directory in all our calls to 
[{{subprocess.check_call()}}|https://docs.python.org/2/library/subprocess.html#subprocess.check_call].
 Right now it defaults to the {{spark-ec2}} directory, which will be 
problematic if you call {{spark-ec2}} from another directory.

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3904) HQL doesn't support the ConstantObjectInspector to pass into UDFs

2014-10-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3904.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2762
[https://github.com/apache/spark/pull/2762]

> HQL doesn't support the ConstantObjectInspector to pass into UDFs
> -
>
> Key: SPARK-3904
> URL: https://issues.apache.org/jira/browse/SPARK-3904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
> Fix For: 1.2.0
>
>
> In HQL, we convert all of the data type into normal ObjectInspectors for 
> UDFs, most of cases it work, however, some of the UDF actually requires the 
> input ObjectInspector to be the ConstantObjectInspector, which will cause 
> exception.
> e.g.
> {panel}
> select named_struct("x", "str") from src limit 1
> {panel}
> It will throws exception like
> {panel}
> 14/10/10 16:25:17 INFO parse.ParseDriver: Parsing command: select 
> named_struct("x", "str") from src
> 14/10/10 16:25:17 INFO parse.ParseDriver: Parse Completed
> 14/10/10 16:25:17 INFO metastore.HiveMetaStore: 0: get_table : db=default 
> tbl=src
> 14/10/10 16:25:17 INFO HiveMetaStore.audit: ugi=hcheng
> ip=unknown-ip-addr  cmd=get_table : db=default tbl=tmp2 
> 14/10/10 16:25:17 ERROR thriftserver.SparkSQLDriver: Failed in [select 
> named_struct("x", "str") from src]
> org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: Even arguments to 
> NAMED_STRUCT must be a constant 
> STRING.org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaStringObjectInspector@2f2dbcfc
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFNamedStruct.initialize(GenericUDFNamedStruct.java:55)
>   at 
> org.apache.spark.sql.hive.HiveGenericUdf.returnInspector$lzycompute(hiveUdfs.scala:129)
>   at 
> org.apache.spark.sql.hive.HiveGenericUdf.returnInspector(hiveUdfs.scala:129)
>   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:158)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$6$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:267)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$6$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:260)
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4127) Streaming Linear Regression

2014-10-28 Thread Anant Daksh Asthana (JIRA)
Anant Daksh Asthana created SPARK-4127:
--

 Summary: Streaming Linear Regression
 Key: SPARK-4127
 URL: https://issues.apache.org/jira/browse/SPARK-4127
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Anant Daksh Asthana
Priority: Minor


Create python bindings for Streaming Linear Regression (MLlib).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4127) Streaming Linear Regression

2014-10-28 Thread Anant Daksh Asthana (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187906#comment-14187906
 ] 

Anant Daksh Asthana commented on SPARK-4127:


[~mengxr] [~freeman-lab] Just added this issue. Could you please assign it to 
me.
Thanks 

> Streaming Linear Regression
> ---
>
> Key: SPARK-4127
> URL: https://issues.apache.org/jira/browse/SPARK-4127
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Anant Daksh Asthana
>Priority: Minor
>
> Create python bindings for Streaming Linear Regression (MLlib).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4127) Streaming Linear Regression

2014-10-28 Thread Anant Daksh Asthana (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anant Daksh Asthana updated SPARK-4127:
---
Description: 
Create python bindings for Streaming Linear Regression (MLlib).
The Mllib file relevant to this issue can be found 
(here)[https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala]

  was:Create python bindings for Streaming Linear Regression (MLlib).


> Streaming Linear Regression
> ---
>
> Key: SPARK-4127
> URL: https://issues.apache.org/jira/browse/SPARK-4127
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Anant Daksh Asthana
>Priority: Minor
>
> Create python bindings for Streaming Linear Regression (MLlib).
> The Mllib file relevant to this issue can be found 
> (here)[https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4127) Streaming Linear Regression

2014-10-28 Thread Anant Daksh Asthana (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anant Daksh Asthana updated SPARK-4127:
---
Description: 
Create python bindings for Streaming Linear Regression (MLlib).
The Mllib file relevant to this issue can be found at : 
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala

  was:
Create python bindings for Streaming Linear Regression (MLlib).
The Mllib file relevant to this issue can be found 
(here)[https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala]


> Streaming Linear Regression
> ---
>
> Key: SPARK-4127
> URL: https://issues.apache.org/jira/browse/SPARK-4127
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Anant Daksh Asthana
>Priority: Minor
>
> Create python bindings for Streaming Linear Regression (MLlib).
> The Mllib file relevant to this issue can be found at : 
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4113) Pyhon UDF on ArrayType

2014-10-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4113.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

> Pyhon UDF on ArrayType
> --
>
> Key: SPARK-4113
> URL: https://issues.apache.org/jira/browse/SPARK-4113
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.2.0
>
>
> from Matei:
> I have a table where column c is of type array. However the following 
> set of commands fails:
> sqlContext.registerFunction("py_func", lambda a: len(a))
> %sql select py_func(c) from some_temp
> Error in SQL statement: java.lang.RuntimeException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 252.0 failed 4 times, most recent failure: Lost task 2.3 in stage 252.0 
> (TID 8454, ip-10-0-157-104.us-west-2.compute.internal): 
> net.razorvine.pickle.PickleException: couldn't introspect javabean: 
> java.lang.IllegalArgumentException: wrong number of arguments
> net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:299)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.dump(Pickler.java:95)
> The same function works if I select a Row from my table into Python and call 
> it on its third column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187900#comment-14187900
 ] 

Nicholas Chammas edited comment on SPARK-3398 at 10/29/14 2:48 AM:
---

If that fixes it for you, then I think the solution is simple. -We just need to 
set {{cwd}} to the user's current working directory in all our calls to 
[{{subprocess.check_call()}}|https://docs.python.org/2/library/subprocess.html#subprocess.check_call].
 Right now it defaults to the {{spark-ec2}} directory, which will be 
problematic if you call {{spark-ec2}} from another directory.-

We need to fix [how the script gets called 
here|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark-ec2#L22].


was (Author: nchammas):
If that fixes it for you, then I think the solution is simple. We just need to 
set {{cwd}} to the user's current working directory in all our calls to 
[{{subprocess.check_call()}}|https://docs.python.org/2/library/subprocess.html#subprocess.check_call].
 Right now it defaults to the {{spark-ec2}} directory, which will be 
problematic if you call {{spark-ec2}} from another directory.

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4120) Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187929#comment-14187929
 ] 

Apache Spark commented on SPARK-4120:
-

User 'ravipesala' has created a pull request for this issue:
https://github.com/apache/spark/pull/2987

> Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not 
> work in SparkSQL
> 
>
> Key: SPARK-4120
> URL: https://issues.apache.org/jira/browse/SPARK-4120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ravindra Pesala
>Assignee: Ravindra Pesala
> Fix For: 1.2.0
>
>
> The queries with more than like 2 tables does not work. 
> {code}
> sql("SELECT * FROM records1 as a,records2 as b,records3 as c where 
> a.key=b.key and a.key=c.key")
> {code}
> The above query gives following exception.
> {code}
> Exception in thread "main" java.lang.RuntimeException: [1.40] failure: 
> ``UNION'' expected but `,' found
> SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and 
> a.key=c.key
>^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
>   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4008) Fix "kryo with fold" in KryoSerializerSuite

2014-10-28 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-4008.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

> Fix "kryo with fold" in KryoSerializerSuite
> ---
>
> Key: SPARK-4008
> URL: https://issues.apache.org/jira/browse/SPARK-4008
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Priority: Minor
>  Labels: unit-test
> Fix For: 1.2.0
>
>
> "kryo with fold" in KryoSerializerSuite is disabled now. It can be fixed by 
> changing the zeroValue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3008) PySpark fails due to zipimport not able to load the assembly jar (/usr/bin/python: No module named pyspark)

2014-10-28 Thread Jai Kumar Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jai Kumar Singh closed SPARK-3008.
--

> PySpark fails due to  zipimport not able to load the assembly jar 
> (/usr/bin/python: No module named pyspark)
> 
>
> Key: SPARK-3008
> URL: https://issues.apache.org/jira/browse/SPARK-3008
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: Assemebly Jar 
> target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.2.0.jar
> jar -tf 
> assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.2.0.jar | wc 
> -l
> 70441
> git sha commit ba28a8fcbc3ba432e7ea4d6f0b535450a6ec96c6
>Reporter: Jai Kumar Singh
>  Labels: pyspark
>
> PySpark is not working. It fails because zipimport not able to import 
> assembly  jar because that contain more than 65536 files.
> Email chains in this regard are below
> http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccamjob8kcgk0pqiogju6uokceyswcusw3xwd5wrs8ikpmgd2...@mail.gmail.com%3E
> https://mail.python.org/pipermail/python-list/2014-May/671353.html
> Is there any work around to bypass the issue ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1442) Add Window function support

2014-10-28 Thread guowei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-1442:
--
Attachment: Window Function.pdf

> Add Window function support
> ---
>
> Key: SPARK-1442
> URL: https://issues.apache.org/jira/browse/SPARK-1442
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Chengxiang Li
> Attachments: Window Function.pdf
>
>
> similiar to Hive, add window function support for catalyst.
> https://issues.apache.org/jira/browse/HIVE-4197
> https://issues.apache.org/jira/browse/HIVE-896



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions

2014-10-28 Thread Zhang, Liye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187982#comment-14187982
 ] 

Zhang, Liye commented on SPARK-4094:


[SPARK-3625|https://issues.apache.org/jira/browse/SPARK-3625] did something 
similar with this issue, but currently it does not support case like this:
*rdd0 = sc.makeRDD(...)*
*rdd1 = rdd0.flatmap(...)*
*rdd1.collect()*
*rdd0.checkpoint()*
*rdd1.count()*
In which *rdd0* would not be checkpointed.
In this JIRA, we will always traverse the whole rdd lineage for any rdd 
actions, until encounter the rdds that has already been checkpointed. Since the 
traverse only check for the status of rdds, the operations will not introduce 
much impact on the performance.

> checkpoint should still be available after rdd actions
> --
>
> Key: SPARK-4094
> URL: https://issues.apache.org/jira/browse/SPARK-4094
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Zhang, Liye
>
> rdd.checkpoint() must be called before any actions on this rdd, if there is 
> any other actions before, checkpoint would never succeed. For the following 
> code as example:
> *rdd = sc.makeRDD(...)*
> *rdd.collect()*
> *rdd.checkpoint()*
> *rdd.count()*
> This rdd would never be checkpointed. But this would not happen for RDD 
> cache. RDD cache would always make successfully before rdd actions no matter 
> whether there is any actions before cache().
> So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4128) Create instructions on fully building Spark in Intellij

2014-10-28 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-4128:
--

 Summary: Create instructions on fully building Spark in Intellij
 Key: SPARK-4128
 URL: https://issues.apache.org/jira/browse/SPARK-4128
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Patrick Wendell
Priority: Blocker


With some of our more complicated modules, I'm not sure whether Intellij 
correctly understands all source locations. Also, we might require specifying 
some profiles for the build to work directly. We should document clearly how to 
start with vanilla Spark master and get the entire thing building in Intellij.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4062) Improve KafkaReceiver to prevent data loss

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188008#comment-14188008
 ] 

Apache Spark commented on SPARK-4062:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/2991

> Improve KafkaReceiver to prevent data loss
> --
>
> Key: SPARK-4062
> URL: https://issues.apache.org/jira/browse/SPARK-4062
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Saisai Shao
> Attachments: RefactoredKafkaReceiver.pdf
>
>
> Current KafkaReceiver has data loss and data re-consuming problem. Here we 
> propose a ReliableKafkaReceiver to improving its reliability and fault 
> tolerance with the power of Spark Streaming's WAL mechanism.
> This is a follow up work of SPARK-3129. Design doc is posted, any comments 
> would be greatly appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4113) Pyhon UDF on ArrayType

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188007#comment-14188007
 ] 

Apache Spark commented on SPARK-4113:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2973

> Pyhon UDF on ArrayType
> --
>
> Key: SPARK-4113
> URL: https://issues.apache.org/jira/browse/SPARK-4113
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.2.0
>
>
> from Matei:
> I have a table where column c is of type array. However the following 
> set of commands fails:
> sqlContext.registerFunction("py_func", lambda a: len(a))
> %sql select py_func(c) from some_temp
> Error in SQL statement: java.lang.RuntimeException: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 252.0 failed 4 times, most recent failure: Lost task 2.3 in stage 252.0 
> (TID 8454, ip-10-0-157-104.us-west-2.compute.internal): 
> net.razorvine.pickle.PickleException: couldn't introspect javabean: 
> java.lang.IllegalArgumentException: wrong number of arguments
> net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:299)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392)
> net.razorvine.pickle.Pickler.dispatch(Pickler.java:195)
> net.razorvine.pickle.Pickler.save(Pickler.java:125)
> net.razorvine.pickle.Pickler.dump(Pickler.java:95)
> The same function works if I select a Row from my table into Python and call 
> it on its third column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-10-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188012#comment-14188012
 ] 

Xiangrui Meng commented on SPARK-3080:
--

Thanks for confirming the issue! I guess this could be a serialization issue. 
Did you observe any executor loss during the computation or in-memory cached 
RDDs switching to on-disk storage?

[~derenrich] Which public dataset are you using? Could you also let me know all 
the ALS parameters and custom Spark settings you used? Thanks!

[~ilganeli] If you do need to run ALS on the full dataset, I recommend using 
the new ALS implementation at

https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala

It should perform better. But it is not merged yet.

> ArrayIndexOutOfBoundsException in ALS for Large datasets
> 
>
> Key: SPARK-3080
> URL: https://issues.apache.org/jira/browse/SPARK-3080
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Burak Yavuz
>
> The stack trace is below:
> {quote}
> java.lang.ArrayIndexOutOfBoundsException: 2716
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> 
> org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)
> 
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> 
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
> 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> {quote}
> This happened after the dataset was sub-sampled. 
> Dataset properties: ~12B ratings
> Setup: 55 r3.8xlarge ec2 instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer

2014-10-28 Thread DB Tsai (JIRA)
DB Tsai created SPARK-4129:
--

 Summary: Performance tuning in MultivariateOnlineSummarizer
 Key: SPARK-4129
 URL: https://issues.apache.org/jira/browse/SPARK-4129
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: DB Tsai


In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop 
through the nonZero elements in the vector. However, activeIterator doesn't 
perform well due to lots of overhead. In this PR, native while loop is used for 
both DenseVector and SparseVector.

The benchmark result with 20 executors using mnist8m dataset:

Before:
DenseVector: 48.2 seconds
SparseVector: 16.3 seconds

After:
DenseVector: 17.8 seconds
SparseVector: 11.2 seconds

Since MultivariateOnlineSummarizer is used in several places, the overall 
performance gain in mllib library will be significant with this PR. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188037#comment-14188037
 ] 

Apache Spark commented on SPARK-4129:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/2992

> Performance tuning in MultivariateOnlineSummarizer
> --
>
> Key: SPARK-4129
> URL: https://issues.apache.org/jira/browse/SPARK-4129
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>
> In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop 
> through the nonZero elements in the vector. However, activeIterator doesn't 
> perform well due to lots of overhead. In this PR, native while loop is used 
> for both DenseVector and SparseVector.
> The benchmark result with 20 executors using mnist8m dataset:
> Before:
> DenseVector: 48.2 seconds
> SparseVector: 16.3 seconds
> After:
> DenseVector: 17.8 seconds
> SparseVector: 11.2 seconds
> Since MultivariateOnlineSummarizer is used in several places, the overall 
> performance gain in mllib library will be significant with this PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4122) Add library to write data back to Kafka

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188052#comment-14188052
 ] 

Apache Spark commented on SPARK-4122:
-

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/2994

> Add library to write data back to Kafka
> ---
>
> Key: SPARK-4122
> URL: https://issues.apache.org/jira/browse/SPARK-4122
> Project: Spark
>  Issue Type: Bug
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4111) [MLlib] Implement regression model evaluation metrics

2014-10-28 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-4111:
---
Summary: [MLlib] Implement regression model evaluation metrics  (was: 
Implement regression model evaluation metrics)

> [MLlib] Implement regression model evaluation metrics
> -
>
> Key: SPARK-4111
> URL: https://issues.apache.org/jira/browse/SPARK-4111
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Yanbo Liang
>
> Supervised machine learning include classification and regression. There is 
> classification metrics (BinaryClassificationMetrics) in MLlib, we also need 
> regression metrics to evaluate the regression model and tunning parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4111) [MLlib] Implement regression model evaluation metrics

2014-10-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186496#comment-14186496
 ] 

Sean Owen commented on SPARK-4111:
--

Is this more than just MAE / RMSE / R2? It might be handy to have a little 
utility class for these although they're almost one-liners already.

> [MLlib] Implement regression model evaluation metrics
> -
>
> Key: SPARK-4111
> URL: https://issues.apache.org/jira/browse/SPARK-4111
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Yanbo Liang
>
> Supervised machine learning include classification and regression. There is 
> classification metrics (BinaryClassificationMetrics) in MLlib, we also need 
> regression metrics to evaluate the regression model and tunning parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4115) add overrided count for EdgeRDD

2014-10-28 Thread Lu Lu (JIRA)
Lu Lu created SPARK-4115:


 Summary: add overrided count for EdgeRDD
 Key: SPARK-4115
 URL: https://issues.apache.org/jira/browse/SPARK-4115
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.1.0
Reporter: Lu Lu
Priority: Minor
 Fix For: 1.1.1


Add overrided count for edge counting of EdgeRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4115) [GraphX] add overrided count for EdgeRDD

2014-10-28 Thread Lu Lu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-4115:
-
Summary: [GraphX] add overrided count for EdgeRDD  (was: add overrided 
count for EdgeRDD)

> [GraphX] add overrided count for EdgeRDD
> 
>
> Key: SPARK-4115
> URL: https://issues.apache.org/jira/browse/SPARK-4115
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.1.0
>Reporter: Lu Lu
>Priority: Minor
> Fix For: 1.1.1
>
>
> Add overrided count for edge counting of EdgeRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4107) Incorrect handling of Channel.read()'s return value may lead to data truncation

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186506#comment-14186506
 ] 

Apache Spark commented on SPARK-4107:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2974

> Incorrect handling of Channel.read()'s return value may lead to data 
> truncation
> ---
>
> Key: SPARK-4107
> URL: https://issues.apache.org/jira/browse/SPARK-4107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>
> When using {{Channel.read()}}, we need to properly handle the return value 
> and account for the case where we've read fewer bytes than expected.  There 
> are a few places where we don't do this properly, which may lead to incorrect 
> data truncation in rare circumstances.  I've opened a PR to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4111) [MLlib] Implement regression model evaluation metrics

2014-10-28 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186514#comment-14186514
 ] 

Yanbo Liang commented on SPARK-4111:


We had implement regression metrics such as explained variance score, MAE, MSE 
and R2 score to evaluate the regression model.
If there is no evaluation metrics, users can not know stand or fall of this 
model and tuning parameter for better result.
I will submit PR for this issue.

> [MLlib] Implement regression model evaluation metrics
> -
>
> Key: SPARK-4111
> URL: https://issues.apache.org/jira/browse/SPARK-4111
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Yanbo Liang
>
> Supervised machine learning include classification and regression. There is 
> classification metrics (BinaryClassificationMetrics) in MLlib, we also need 
> regression metrics to evaluate the regression model and tunning parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4115) [GraphX] add overrided count for EdgeRDD

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186518#comment-14186518
 ] 

Apache Spark commented on SPARK-4115:
-

User 'luluorta' has created a pull request for this issue:
https://github.com/apache/spark/pull/2975

> [GraphX] add overrided count for EdgeRDD
> 
>
> Key: SPARK-4115
> URL: https://issues.apache.org/jira/browse/SPARK-4115
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.1.0
>Reporter: Lu Lu
>Priority: Minor
> Fix For: 1.1.1
>
>
> Add overrided count for edge counting of EdgeRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4111) [MLlib] Implement regression model evaluation metrics

2014-10-28 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186514#comment-14186514
 ] 

Yanbo Liang edited comment on SPARK-4111 at 10/28/14 7:37 AM:
--

We had implement regression metrics such as explained variance score, MAE, MSE 
and R2 score to evaluate the regression model.
If there is no evaluation metrics, users can not know stand or fall of this 
model and tuning parameter for better result.
I will submit PR for this issue. Can I have this assigned to me?


was (Author: yanboliang):
We had implement regression metrics such as explained variance score, MAE, MSE 
and R2 score to evaluate the regression model.
If there is no evaluation metrics, users can not know stand or fall of this 
model and tuning parameter for better result.
I will submit PR for this issue.

> [MLlib] Implement regression model evaluation metrics
> -
>
> Key: SPARK-4111
> URL: https://issues.apache.org/jira/browse/SPARK-4111
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Yanbo Liang
>
> Supervised machine learning include classification and regression. There is 
> classification metrics (BinaryClassificationMetrics) in MLlib, we also need 
> regression metrics to evaluate the regression model and tunning parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

2014-10-28 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186562#comment-14186562
 ] 

Sandy Ryza commented on SPARK-3461:
---

SPARK-2926 could help with this as well.

> Support external groupByKey using repartitionAndSortWithinPartitions
> 
>
> Key: SPARK-3461
> URL: https://issues.apache.org/jira/browse/SPARK-3461
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Davies Liu
>Priority: Critical
>
> Given that we have SPARK-2978, it seems like we could support an external 
> group by operator pretty easily. We'd just have to wrap the existing iterator 
> exposed by SPARK-2978 with a lookahead iterator that detects the group 
> boundaries. Also, we'd have to override the cache() operator to cache the 
> parent RDD so that if this object is cached it doesn't wind through the 
> iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the 
> stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write 
> jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4116) Delete the abandoned log4j-spark-container.properties

2014-10-28 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-4116:
--

 Summary: Delete the abandoned log4j-spark-container.properties
 Key: SPARK-4116
 URL: https://issues.apache.org/jira/browse/SPARK-4116
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: WangTaoTheTonic
Priority: Minor


Seems like the properties file was abandoned, we could delete it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4116) Delete the abandoned log4j-spark-container.properties

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186641#comment-14186641
 ] 

Apache Spark commented on SPARK-4116:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/2977

> Delete the abandoned log4j-spark-container.properties
> -
>
> Key: SPARK-4116
> URL: https://issues.apache.org/jira/browse/SPARK-4116
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: WangTaoTheTonic
>Priority: Minor
>
> Seems like the properties file was abandoned, we could delete it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib

2014-10-28 Thread Ashutosh Trivedi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186656#comment-14186656
 ] 

Ashutosh Trivedi commented on SPARK-4038:
-

tagging [~Kaushik619] As he is also working with me on this.

> Outlier Detection Algorithm for MLlib
> -
>
> Key: SPARK-4038
> URL: https://issues.apache.org/jira/browse/SPARK-4038
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Ashutosh Trivedi
>Priority: Minor
>
> The aim of this JIRA is to discuss about which parallel outlier detection 
> algorithms can be included in MLlib. 
> The one which I am familiar with is Attribute Value Frequency (AVF). It 
> scales linearly with the number of data points and attributes, and relies on 
> a single data scan. It is not distance based and well suited for categorical 
> data. In original paper  a parallel version is also given, which is not 
> complected to implement.  I am working on the implementation and soon submit 
> the initial code for review.
> Here is the Link for the paper
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382
> As pointed out by Xiangrui in discussion 
> http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html
> There are other algorithms also. Lets discuss about which will be more 
> general and easily paralleled.
>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib

2014-10-28 Thread Ashutosh Trivedi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186659#comment-14186659
 ] 

Ashutosh Trivedi commented on SPARK-2335:
-

Thanks [~bgawalt] and [~slcclimber] for helping us out. Looking forward to 
working with you guys. Tagging [~Kaushik619] here, as he is also working with 
me.

We will be giving inputs here soon.


> k-Nearest Neighbor classification and regression for MLLib
> --
>
> Key: SPARK-2335
> URL: https://issues.apache.org/jira/browse/SPARK-2335
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: features, newbie
>
> The k-Nearest Neighbor model for classification and regression problems is a 
> simple and intuitive approach, offering a straightforward path to creating 
> non-linear decision/estimation contours. It's downsides -- high variance 
> (sensitivity to the known training data set) and computational intensity for 
> estimating new point labels -- both play to Spark's big data strengths: lots 
> of data mitigates data concerns; lots of workers mitigate computational 
> latency. 
> We should include kNN models as options in MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3961) Python API for mllib.feature

2014-10-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3961.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2819
[https://github.com/apache/spark/pull/2819]

> Python API for mllib.feature
> 
>
> Key: SPARK-3961
> URL: https://issues.apache.org/jira/browse/SPARK-3961
> Project: Spark
>  Issue Type: New Feature
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> Add completed API for mllib.feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4111) [MLlib] Implement regression model evaluation metrics

2014-10-28 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186710#comment-14186710
 ] 

Yanbo Liang commented on SPARK-4111:


https://github.com/apache/spark/pull/2978

> [MLlib] Implement regression model evaluation metrics
> -
>
> Key: SPARK-4111
> URL: https://issues.apache.org/jira/browse/SPARK-4111
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Yanbo Liang
>
> Supervised machine learning include classification and regression. There is 
> classification metrics (BinaryClassificationMetrics) in MLlib, we also need 
> regression metrics to evaluate the regression model and tunning parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186711#comment-14186711
 ] 

Apache Spark commented on SPARK-2352:
-

User 'bgreeven' has created a pull request for this issue:
https://github.com/apache/spark/pull/1290

> [MLLIB] Add Artificial Neural Network (ANN) to Spark
> 
>
> Key: SPARK-2352
> URL: https://issues.apache.org/jira/browse/SPARK-2352
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
> Environment: MLLIB code
>Reporter: Bert Greevenbosch
>Assignee: Bert Greevenbosch
>
> It would be good if the Machine Learning Library contained Artificial Neural 
> Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4111) [MLlib] Implement regression model evaluation metrics

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186741#comment-14186741
 ] 

Apache Spark commented on SPARK-4111:
-

User 'yanbohappy' has created a pull request for this issue:
https://github.com/apache/spark/pull/2978

> [MLlib] Implement regression model evaluation metrics
> -
>
> Key: SPARK-4111
> URL: https://issues.apache.org/jira/browse/SPARK-4111
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Yanbo Liang
>
> Supervised machine learning include classification and regression. There is 
> classification metrics (BinaryClassificationMetrics) in MLlib, we also need 
> regression metrics to evaluate the regression model and tunning parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4116) Delete the abandoned log4j-spark-container.properties

2014-10-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-4116.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

> Delete the abandoned log4j-spark-container.properties
> -
>
> Key: SPARK-4116
> URL: https://issues.apache.org/jira/browse/SPARK-4116
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: WangTaoTheTonic
>Priority: Minor
> Fix For: 1.2.0
>
>
> Seems like the properties file was abandoned, we could delete it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4095) [YARN][Minor]extract val isLaunchingDriver in ClientBase

2014-10-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-4095.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

> [YARN][Minor]extract val isLaunchingDriver in ClientBase
> 
>
> Key: SPARK-4095
> URL: https://issues.apache.org/jira/browse/SPARK-4095
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: WangTaoTheTonic
>Priority: Minor
> Fix For: 1.2.0
>
>
> Instead of checking if `args.userClass` is null repeatedly, we extract it to 
> an global val as in `ApplicationMaster`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3837) Warn when YARN is killing containers for exceeding memory limits

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186850#comment-14186850
 ] 

Apache Spark commented on SPARK-3837:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/2744

> Warn when YARN is killing containers for exceeding memory limits
> 
>
> Key: SPARK-3837
> URL: https://issues.apache.org/jira/browse/SPARK-3837
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>
> YARN now lets application masters know when it kills their containers for 
> exceeding memory limits.  Spark should log something when this happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4117) Spark on Yarn handle AM being told command from RM

2014-10-28 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-4117:


 Summary: Spark on Yarn handle AM being told command from RM
 Key: SPARK-4117
 URL: https://issues.apache.org/jira/browse/SPARK-4117
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves


In the allocateResponse from the RM it can send commands that the AM should 
follow. for instance AM_RESYNC and AM_SHUTDOWN.  We should add support for 
those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4108) Fix uses of @deprecated in catalyst dataTypes

2014-10-28 Thread Anant Daksh Asthana (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anant Daksh Asthana updated SPARK-4108:
---
Component/s: SQL

> Fix uses of @deprecated in catalyst dataTypes
> -
>
> Key: SPARK-4108
> URL: https://issues.apache.org/jira/browse/SPARK-4108
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Anant Daksh Asthana
>Priority: Trivial
>
> @deprecated takes 2 parameters message and version 
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala
> has a usage of @deprecated with just one parameter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3611) Show number of cores for each executor in application web UI

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186909#comment-14186909
 ] 

Apache Spark commented on SPARK-3611:
-

User 'devldevelopment' has created a pull request for this issue:
https://github.com/apache/spark/pull/2980

> Show number of cores for each executor in application web UI
> 
>
> Key: SPARK-3611
> URL: https://issues.apache.org/jira/browse/SPARK-3611
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This number is not always fully known, because e.g. in Mesos your executors 
> can scale up and down in # of CPUs, but it would be nice to show at least the 
> number of cores the machine has in that case, or the # of cores the executor 
> has been configured with if known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-10-28 Thread zzc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186910#comment-14186910
 ] 

zzc commented on SPARK-2468:


Hi,Reynold Xin, What time does this issue can be solved?
I need to improve shuffle performance as soon as possible.

> Netty-based block server / client module
> 
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4098) use appUIAddress instead of appUIHostPort in yarn-client mode

2014-10-28 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-4098.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

> use appUIAddress instead of appUIHostPort in yarn-client mode
> -
>
> Key: SPARK-4098
> URL: https://issues.apache.org/jira/browse/SPARK-4098
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: WangTaoTheTonic
>Priority: Minor
> Fix For: 1.2.0
>
>
> I don't understant why using appUIHostPort here, but in yarn-cluster mode we 
> use appUIAddress. So I replaced it. 
> Testing results state it is ok to change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib

2014-10-28 Thread Kaushik Ranjan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186954#comment-14186954
 ] 

Kaushik Ranjan commented on SPARK-2335:
---

Hi [~bgawalt].

For evaluation of KNN-join, one needs to calculate z-scores of data-points 
within the dataset.

Yu-ISHIKAWA has implemented the following
https://gist.github.com/yu-iskw/37ae208c530f7018e048

Will it be justified to put up a NewFeature Issue to address z-scores?


> k-Nearest Neighbor classification and regression for MLLib
> --
>
> Key: SPARK-2335
> URL: https://issues.apache.org/jira/browse/SPARK-2335
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: features, newbie
>
> The k-Nearest Neighbor model for classification and regression problems is a 
> simple and intuitive approach, offering a straightforward path to creating 
> non-linear decision/estimation contours. It's downsides -- high variance 
> (sensitivity to the known training data set) and computational intensity for 
> estimating new point labels -- both play to Spark's big data strengths: lots 
> of data mitigates data concerns; lots of workers mitigate computational 
> latency. 
> We should include kNN models as options in MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark

2014-10-28 Thread Oleg Zhurakousky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186987#comment-14186987
 ] 

Oleg Zhurakousky commented on SPARK-3561:
-

[~vanzin]

I would not call it hard (we have done it in the initial POC by simply mixing 
custom trait into SC - essentially extending it), however I do agree that a lot 
of Spark's initialization would still happen due to the implementation of SC 
itself thus creating and initializing some of the artifacts that may not be 
used with different execution context. 
Question; Why was it done like this and not pushed into some SC.init operation? 

> Allow for pluggable execution contexts in Spark
> ---
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal: 
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@Experimental) not exposed to end users of Spark. 
> The trait will define 6 operations: 
> * hadoopFile 
> * newAPIHadoopFile 
> * broadcast 
> * runJob 
> * persist
> * unpersist
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext. 
> Please see the attached design doc for more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3657) yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl starting spark-shell

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187070#comment-14187070
 ] 

Apache Spark commented on SPARK-3657:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2981

> yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl 
> starting spark-shell
> ---
>
> Key: SPARK-3657
> URL: https://issues.apache.org/jira/browse/SPARK-3657
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Priority: Blocker
>
> YarnRMClientImpl.registerApplicationMaster can throw null pointer exception 
> when setting the trackingurl if its empty:
> appMasterRequest.setTrackingUrl(new URI(uiAddress).getAuthority())
> I hit this just start spark-shell without the tracking url set.
> 14/09/23 16:18:34 INFO yarn.YarnRMClientImpl: Connecting to ResourceManager 
> at kryptonitered-jt1.red.ygrid.yahoo.com/98.139.154.99:8030
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:710)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterRequestPBImpl.setTrackingUrl(RegisterApplicationMasterRequestPBImpl.java:132)
> at 
> org.apache.spark.deploy.yarn.YarnRMClientImpl.registerApplicationMaster(YarnRMClientImpl.scala:102)
> at 
> org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:55)
> at 
> org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:38)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:168)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:206)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:120)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4031) Read broadcast variables on use

2014-10-28 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-4031.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

> Read broadcast variables on use
> ---
>
> Key: SPARK-4031
> URL: https://issues.apache.org/jira/browse/SPARK-4031
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 1.2.0
>
>
> This is a proposal to change the broadcast variable implementations in Spark 
> to only read values when they are used rather than on deserializing.
> This change will be very helpful (and in our use cases required) for complex 
> applications which have a large number of broadcast variables. For example if 
> broadcast variables are class members, they are captured in closures even 
> when they are not used.
> We could also consider cleaning closures more aggressively, but that might be 
> a more complex change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4031) Read broadcast variables on use

2014-10-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187122#comment-14187122
 ] 

Shivaram Venkataraman commented on SPARK-4031:
--

Issue resolved by pull request 2871
https://github.com/apache/spark/pull/2871

> Read broadcast variables on use
> ---
>
> Key: SPARK-4031
> URL: https://issues.apache.org/jira/browse/SPARK-4031
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 1.2.0
>
>
> This is a proposal to change the broadcast variable implementations in Spark 
> to only read values when they are used rather than on deserializing.
> This change will be very helpful (and in our use cases required) for complex 
> applications which have a large number of broadcast variables. For example if 
> broadcast variables are class members, they are captured in closures even 
> when they are not used.
> We could also consider cleaning closures more aggressively, but that might be 
> a more complex change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4118) Create python bindings for Streaming KMeans

2014-10-28 Thread Anant Daksh Asthana (JIRA)
Anant Daksh Asthana created SPARK-4118:
--

 Summary: Create python bindings for Streaming KMeans
 Key: SPARK-4118
 URL: https://issues.apache.org/jira/browse/SPARK-4118
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Anant Daksh Asthana
Priority: Minor


Create Python bindings for Streaming K-means
This is in reference to https://issues.apache.org/jira/browse/SPARK-3254
which adds Streaming K-means functionality to MLLib.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files

2014-10-28 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-4119:
-

 Summary: Don't rely on HIVE_DEV_HOME to find .q files
 Key: SPARK-4119
 URL: https://issues.apache.org/jira/browse/SPARK-4119
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.1.1
Reporter: Cheng Lian
Priority: Minor


After merging in Hive 0.13.1 support, a bunch of .q files and golden answer 
files got updated. Unfortunately, some .q were updated in Hive. For example, an 
ORDER BY clause was added to groupby1_limit.q for bug fix.

With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with false 
test failures. Because .q files are looked up from HIVE_DEV_HOME and outdated 
.q files are used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-10-28 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187183#comment-14187183
 ] 

Reynold Xin commented on SPARK-2468:


Scheduled to go in in 1.2.

> Netty-based block server / client module
> 
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4120) Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL

2014-10-28 Thread Ravindra Pesala (JIRA)
Ravindra Pesala created SPARK-4120:
--

 Summary: Join of multiple tables with syntax like SELECT .. FROM 
T1,T2,T3.. does not work in SparkSQL
 Key: SPARK-4120
 URL: https://issues.apache.org/jira/browse/SPARK-4120
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Ravindra Pesala
 Fix For: 1.2.0


The queries with more than like 2 tables does not work. 
{code}
sql("SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key 
and a.key=c.key")
{code}

The above query gives following exception.
{code}
Exception in thread "main" java.lang.RuntimeException: [1.40] failure: 
``UNION'' expected but `,' found

SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and 
a.key=c.key
   ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75)

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3611) Show number of cores for each executor in application web UI

2014-10-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3611:
-
Affects Version/s: 1.0.0

> Show number of cores for each executor in application web UI
> 
>
> Key: SPARK-3611
> URL: https://issues.apache.org/jira/browse/SPARK-3611
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This number is not always fully known, because e.g. in Mesos your executors 
> can scale up and down in # of CPUs, but it would be nice to show at least the 
> number of cores the machine has in that case, or the # of cores the executor 
> has been configured with if known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4110) Wrong comments about default settings in spark-daemon.sh

2014-10-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4110.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Assignee: Kousuke Saruta
Target Version/s: 1.1.1, 1.2.0  (was: 1.2.0)

> Wrong comments about default settings in spark-daemon.sh
> 
>
> Key: SPARK-4110
> URL: https://issues.apache.org/jira/browse/SPARK-4110
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> In spark-daemon.sh, thare are following comments.
> {code}
> #   SPARK_CONF_DIR  Alternate conf dir. Default is ${SPARK_PREFIX}/conf.
> #   SPARK_LOG_DIR   Where log files are stored.  PWD by default.
> {code}
> But, I think the default value for SPARK_CONF_DIR is ${SPARK_HOME}/conf and 
> for SPARK_LOG_DIR is ${SPARK_HOME}/logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4107) Incorrect handling of Channel.read()'s return value may lead to data truncation

2014-10-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4107.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

> Incorrect handling of Channel.read()'s return value may lead to data 
> truncation
> ---
>
> Key: SPARK-4107
> URL: https://issues.apache.org/jira/browse/SPARK-4107
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.1, 1.2.0
>
>
> When using {{Channel.read()}}, we need to properly handle the return value 
> and account for the case where we've read fewer bytes than expected.  There 
> are a few places where we don't do this properly, which may lead to incorrect 
> data truncation in rare circumstances.  I've opened a PR to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4096) let ApplicationMaster accept executor memory argument in same format as JVM memory strings

2014-10-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4096:
-
Affects Version/s: 1.1.0

> let ApplicationMaster accept executor memory argument in same format as JVM 
> memory strings
> --
>
> Key: SPARK-4096
> URL: https://issues.apache.org/jira/browse/SPARK-4096
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: WangTaoTheTonic
>Priority: Minor
> Fix For: 1.2.0
>
>
> Here ApplicationMaster accept executor memory argument only in number format, 
> we should let it accept JVM style memory strings as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4096) let ApplicationMaster accept executor memory argument in same format as JVM memory strings

2014-10-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4096.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Assignee: WangTaoTheTonic
Target Version/s: 1.2.0

> let ApplicationMaster accept executor memory argument in same format as JVM 
> memory strings
> --
>
> Key: SPARK-4096
> URL: https://issues.apache.org/jira/browse/SPARK-4096
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: WangTaoTheTonic
>Assignee: WangTaoTheTonic
>Priority: Minor
> Fix For: 1.2.0
>
>
> Here ApplicationMaster accept executor memory argument only in number format, 
> we should let it accept JVM style memory strings as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3657) yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl starting spark-shell

2014-10-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3657.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Kousuke Saruta

> yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl 
> starting spark-shell
> ---
>
> Key: SPARK-3657
> URL: https://issues.apache.org/jira/browse/SPARK-3657
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 1.2.0
>
>
> YarnRMClientImpl.registerApplicationMaster can throw null pointer exception 
> when setting the trackingurl if its empty:
> appMasterRequest.setTrackingUrl(new URI(uiAddress).getAuthority())
> I hit this just start spark-shell without the tracking url set.
> 14/09/23 16:18:34 INFO yarn.YarnRMClientImpl: Connecting to ResourceManager 
> at kryptonitered-jt1.red.ygrid.yahoo.com/98.139.154.99:8030
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:710)
> at 
> org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterRequestPBImpl.setTrackingUrl(RegisterApplicationMasterRequestPBImpl.java:132)
> at 
> org.apache.spark.deploy.yarn.YarnRMClientImpl.registerApplicationMaster(YarnRMClientImpl.scala:102)
> at 
> org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:55)
> at 
> org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:38)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:168)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:206)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:120)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4089) The version number of Spark in _config.yaml is wrong.

2014-10-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4089.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Kousuke Saruta

> The version number of Spark in _config.yaml is wrong.
> -
>
> Key: SPARK-4089
> URL: https://issues.apache.org/jira/browse/SPARK-4089
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.2.0
>
>
> The version number of Spark in docs/_config.yaml for master branch should be 
> 1.2.0 for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4065) pyspark will not use ipython on Windows

2014-10-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4065.

  Resolution: Fixed
   Fix Version/s: 1.2.0
  1.1.1
Target Version/s: 1.1.1, 1.2.0

> pyspark will not use ipython on Windows
> ---
>
> Key: SPARK-4065
> URL: https://issues.apache.org/jira/browse/SPARK-4065
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Michael Griffiths
>Assignee: Michael Griffiths
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> pyspark2.cmd will not launch ipython, even if the environment variables are 
> set. It doesn't check for the existence of ipython environment variables - in 
> all cases, it will just launch python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4065) pyspark will not use ipython on Windows

2014-10-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4065:
-
Assignee: Michael Griffiths

> pyspark will not use ipython on Windows
> ---
>
> Key: SPARK-4065
> URL: https://issues.apache.org/jira/browse/SPARK-4065
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Michael Griffiths
>Assignee: Michael Griffiths
>Priority: Minor
> Fix For: 1.1.1, 1.2.0
>
>
> pyspark2.cmd will not launch ipython, even if the environment variables are 
> set. It doesn't check for the existence of ipython environment variables - in 
> all cases, it will just launch python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4048) Enhance and extend hadoop-provided profile

2014-10-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187355#comment-14187355
 ] 

Apache Spark commented on SPARK-4048:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2982

> Enhance and extend hadoop-provided profile
> --
>
> Key: SPARK-4048
> URL: https://issues.apache.org/jira/browse/SPARK-4048
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> The hadoop-provided profile is used to not package Hadoop dependencies inside 
> the Spark assembly. It works, sort of, but it could use some enhancements. A 
> quick list:
> - It doesn't include all things that could be removed from the assembly
> - It doesn't work well when you're publishing artifacts based on it 
> (SPARK-3812 fixes this)
> - There are other dependencies that could use similar treatment: Hive, HBase 
> (for the examples), Flume, Parquet, maybe others I'm missing at the moment.
> - Unit tests, more specifically, those that use local-cluster mode, do not 
> work when the assembly is built with this profile enabled.
> - The scripts to launch Spark jobs do not add needed "provided" jars to the 
> classpath when this profile is enabled, leaving it for people to figure that 
> out for themselves.
> - The examples assembly duplicates a lot of things in the main assembly.
> Part of this task is selfish since we build internally with this profile and 
> we'd like to make it easier for us to merge changes without having to keep 
> too many patches on top of upstream. But those feel like good improvements to 
> me, regardless.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4058) Log file name is hard coded even though there is a variable '$LOG_FILE '

2014-10-28 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4058.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Kousuke Saruta

> Log file name is hard coded even though there is a variable '$LOG_FILE '
> 
>
> Key: SPARK-4058
> URL: https://issues.apache.org/jira/browse/SPARK-4058
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 1.2.0
>
>
> In a script 'python/run-tests', log file name is represented by a variable 
> 'LOG_FILE' and it is used in run-tests. But, there are some hard-coded log 
> file name in the script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3814) Support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL

2014-10-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3814.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2961
[https://github.com/apache/spark/pull/2961]

> Support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL
> --
>
> Key: SPARK-3814
> URL: https://issues.apache.org/jira/browse/SPARK-3814
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yana Kadiyska
>Assignee: Ravindra Pesala
>Priority: Minor
> Fix For: 1.2.0
>
>
> Error: java.lang.RuntimeException: 
> Unsupported language features in query: select (case when bit_field & 1=1 
> then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' 
> LIMIT 2
> TOK_QUERY
>   TOK_FROM
> TOK_TABREF
>   TOK_TABNAME
>mytable 
>   TOK_INSERT
> TOK_DESTINATION
>   TOK_DIR
> TOK_TMP_FILE
> TOK_SELECT
>   TOK_SELEXPR
> TOK_FUNCTION
>   when
>   =
> &
>   TOK_TABLE_OR_COL
> bit_field
>   1
> 1
>   -
> TOK_TABLE_OR_COL
>   r_end
> TOK_TABLE_OR_COL
>   r_start
>   TOK_NULL
> TOK_WHERE
>   =
> TOK_TABLE_OR_COL
>   pkey
> '0178-2014-07'
> TOK_LIMIT
>   2
> SQLState:  null
> ErrorCode: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3988) Public API for DateType support

2014-10-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3988.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2901
[https://github.com/apache/spark/pull/2901]

> Public API for DateType support
> ---
>
> Key: SPARK-3988
> URL: https://issues.apache.org/jira/browse/SPARK-3988
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 1.2.0
>
>
> add Python API and something else.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4121) Master build failures after shading commons-math3

2014-10-28 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-4121:


 Summary: Master build failures after shading commons-math3
 Key: SPARK-4121
 URL: https://issues.apache.org/jira/browse/SPARK-4121
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Xiangrui Meng
Priority: Blocker


The Spark master Maven build kept failing after we replace colt with 
commons-math3 and shade the later:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/

The error message is:

{code}
KMeansClusterSuite:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark assembly has been built with Hive, including Datanucleus jars on classpath
- task size should be small in both training and prediction *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 
9, localhost): java.io.InvalidClassException: 
org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
classdesc serialVersionUID = -795011761847245121, local class serialVersionUID 
= 424924496318419
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
org.apache.spark.scheduler.Task.run(Task.scala:56)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

This test passed with local sbt build. So it should be caused by shading. Maybe 
there are two versions of commons-math3 (hadoop depends on it), or MLlib 
doesn't use the shaded version at compile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4121) Master build failures after shading commons-math3

2014-10-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4121:
-
Component/s: MLlib
 Build

> Master build failures after shading commons-math3
> -
>
> Key: SPARK-4121
> URL: https://issues.apache.org/jira/browse/SPARK-4121
> Project: Spark
>  Issue Type: Bug
>  Components: Build, MLlib, Spark Core
>Affects Versions: 1.2.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> The Spark master Maven build kept failing after we replace colt with 
> commons-math3 and shade the later:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/
> The error message is:
> {code}
> KMeansClusterSuite:
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> - task size should be small in both training and prediction *** FAILED ***
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 
> (TID 9, localhost): java.io.InvalidClassException: 
> org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
> classdesc serialVersionUID = -795011761847245121, local class 
> serialVersionUID = 424924496318419
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> This test passed with local sbt build. So it should be caused by shading. 
> Maybe there are two versions of commons-math3 (hadoop depends on it), or 
> MLlib doesn't use the shaded version at compile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4121) Master build failures after shading commons-math3

2014-10-28 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4121:
-
Description: 
The Spark master Maven build kept failing after we replace colt with 
commons-math3 and shade the latter:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/

The error message is:

{code}
KMeansClusterSuite:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark assembly has been built with Hive, including Datanucleus jars on classpath
- task size should be small in both training and prediction *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 
9, localhost): java.io.InvalidClassException: 
org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
classdesc serialVersionUID = -795011761847245121, local class serialVersionUID 
= 424924496318419
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
org.apache.spark.scheduler.Task.run(Task.scala:56)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

This test passed in local sbt build. So the issue should be caused by shading. 
Maybe there are two versions of commons-math3 (hadoop depends on it), or MLlib 
doesn't use the shaded version at compile.

[~srowen] Could you take a look? Thanks!

  was:
The Spark master Maven build kept failing after we replace colt with 
commons-math3 and shade the later:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/

The error message is:

{code}
KMeansClusterSuite:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark assembly has been built with Hive, including Datanucleus jars on classpath
- task size should be small in both training and prediction *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 
9, localhost): java.io.InvalidClassException: 
org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
classdesc serialVersionUID = -795011761847245121, local class serialVersionUID 
= 424924496318419
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStre

[jira] [Commented] (SPARK-4121) Master build failures after shading commons-math3

2014-10-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187453#comment-14187453
 ] 

Patrick Wendell commented on SPARK-4121:


[~srowen] - can you help with this? This is likely happening because the 
PoissonSampler on the driver is using the classpath from Maven (with the 
unmodified version of PoissonSampler) and the executors are using the version 
from the assembly jar, which has package relocations of the commons math 
dependency in the byte code. This is a test that uses "local-cluster" mode.

Is there a reason we are doing these relocations in the assembly only? Would it 
be better to actually shade-and-inline commons-math in both the spark-core and 
spark-mllib package jars?

Having discrepancies between the assmebly and package jars I'm guessing could 
lead to problems other than just this test issue. It also means that 
applications which compile against Spark's dependencies rather than running 
through the Spark assembly packages won't get the benefit of the shading we've 
done.

> Master build failures after shading commons-math3
> -
>
> Key: SPARK-4121
> URL: https://issues.apache.org/jira/browse/SPARK-4121
> Project: Spark
>  Issue Type: Bug
>  Components: Build, MLlib, Spark Core
>Affects Versions: 1.2.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> The Spark master Maven build kept failing after we replace colt with 
> commons-math3 and shade the latter:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/
> The error message is:
> {code}
> KMeansClusterSuite:
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> - task size should be small in both training and prediction *** FAILED ***
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 
> (TID 9, localhost): java.io.InvalidClassException: 
> org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
> classdesc serialVersionUID = -795011761847245121, local class 
> serialVersionUID = 424924496318419
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> This test passed in local sbt build. So the issue should be caused by 
> shading. Maybe there are two versions of commons-math3 (hadoop depends on 
> it), or MLlib doesn't use the shaded version at compile.
> [~srowen] Could you take a look? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4121) Master build failures after shading commons-math3

2014-10-28 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187458#comment-14187458
 ] 

Josh Rosen commented on SPARK-4121:
---

Here's an easy command to reproduce this:

{code}
mvn -DskipTests package
mvn test -DwildcardSuites=org.apache.spark.mllib.clustering.KMeansClusterSuite
{code}

> Master build failures after shading commons-math3
> -
>
> Key: SPARK-4121
> URL: https://issues.apache.org/jira/browse/SPARK-4121
> Project: Spark
>  Issue Type: Bug
>  Components: Build, MLlib, Spark Core
>Affects Versions: 1.2.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> The Spark master Maven build kept failing after we replace colt with 
> commons-math3 and shade the latter:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/
> The error message is:
> {code}
> KMeansClusterSuite:
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> - task size should be small in both training and prediction *** FAILED ***
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 
> (TID 9, localhost): java.io.InvalidClassException: 
> org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
> classdesc serialVersionUID = -795011761847245121, local class 
> serialVersionUID = 424924496318419
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> This test passed in local sbt build. So the issue should be caused by 
> shading. Maybe there are two versions of commons-math3 (hadoop depends on 
> it), or MLlib doesn't use the shaded version at compile.
> [~srowen] Could you take a look? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4122) Add library to write data back to Kafka

2014-10-28 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created SPARK-4122:
---

 Summary: Add library to write data back to Kafka
 Key: SPARK-4122
 URL: https://issues.apache.org/jira/browse/SPARK-4122
 Project: Spark
  Issue Type: Bug
Reporter: Hari Shreedharan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Michael Griffiths (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187471#comment-14187471
 ] 

Michael Griffiths commented on SPARK-3398:
--

I'm running into an issue with {{wait_for_cluster_state}} - specifically, 
waiting {{for ssh-ready}}.

AFAICT the [valid states in boto 
are|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.instance.InstanceState]:

* pending
* running
* shutting-down
* terminated
* stopping
* stopped

When I invoke spark_ec2.py, it never moves to the next stage (infinite loop).

Is {{ssh-ready}} a state in a different version of boto? 

Thanks,
Michael

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Michael Griffiths (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187471#comment-14187471
 ] 

Michael Griffiths edited comment on SPARK-3398 at 10/28/14 9:10 PM:


I'm running into an issue with {{wait_for_cluster_state}} - specifically, 
waiting for {{ssh-ready}}.

AFAICT the [valid states in boto 
are|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.instance.InstanceState]:

* pending
* running
* shutting-down
* terminated
* stopping
* stopped

When I invoke spark_ec2.py, it never moves to the next stage (infinite loop).

Is {{ssh-ready}} a state in a different version of boto? 

Thanks,
Michael


was (Author: michael.griffiths):
I'm running into an issue with {{wait_for_cluster_state}} - specifically, 
waiting {{for ssh-ready}}.

AFAICT the [valid states in boto 
are|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.instance.InstanceState]:

* pending
* running
* shutting-down
* terminated
* stopping
* stopped

When I invoke spark_ec2.py, it never moves to the next stage (infinite loop).

Is {{ssh-ready}} a state in a different version of boto? 

Thanks,
Michael

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4121) Master build failures after shading commons-math3

2014-10-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187475#comment-14187475
 ] 

Sean Owen commented on SPARK-4121:
--

Yeah I was seeing this locally, but not on the Jenkins test build, so chalked 
it up to weirdness in my build.
I think the answer may indeed be to do the relocating in core/mllib itself. 
I'll get on that.

> Master build failures after shading commons-math3
> -
>
> Key: SPARK-4121
> URL: https://issues.apache.org/jira/browse/SPARK-4121
> Project: Spark
>  Issue Type: Bug
>  Components: Build, MLlib, Spark Core
>Affects Versions: 1.2.0
>Reporter: Xiangrui Meng
>Priority: Blocker
>
> The Spark master Maven build kept failing after we replace colt with 
> commons-math3 and shade the latter:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/
> The error message is:
> {code}
> KMeansClusterSuite:
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> - task size should be small in both training and prediction *** FAILED ***
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 
> (TID 9, localhost): java.io.InvalidClassException: 
> org.apache.spark.util.random.PoissonSampler; local class incompatible: stream 
> classdesc serialVersionUID = -795011761847245121, local class 
> serialVersionUID = 424924496318419
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
> 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
> 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> This test passed in local sbt build. So the issue should be caused by 
> shading. Maybe there are two versions of commons-math3 (hadoop depends on 
> it), or MLlib doesn't use the shaded version at compile.
> [~srowen] Could you take a look? Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3922) A global UTF8 constant for Spark

2014-10-28 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3922.

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Shixiong Zhu

> A global UTF8 constant for Spark
> 
>
> Key: SPARK-3922
> URL: https://issues.apache.org/jira/browse/SPARK-3922
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.2.0
>
>
> A global UTF8 constant is very helpful to handle encoding problems when 
> converting between String and bytes. There are several solutions here:
> 1. Add `val UTF_8 = Charset.forName("UTF-8")` to Utils.scala
> 2. java.nio.charset.StandardCharsets.UTF_8 (require JDK7)
> 3. io.netty.util.CharsetUtil.UTF_8
> 4. com.google.common.base.Charsets.UTF_8
> 5. org.apache.commons.lang.CharEncoding.UTF_8
> 6. org.apache.commons.lang3.CharEncoding.UTF_8
> IMO, I prefer option 1) because people can find it easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3343) Support for CREATE TABLE AS SELECT that specifies the format

2014-10-28 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3343.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2570
[https://github.com/apache/spark/pull/2570]

> Support for CREATE TABLE AS SELECT that specifies the format
> 
>
> Key: SPARK-3343
> URL: https://issues.apache.org/jira/browse/SPARK-3343
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: HuQizhong
> Fix For: 1.2.0
>
>
> hql("""CREATE TABLE tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS
> TERMINATED BY ',' LINES TERMINATED BY '\n' as  SELECT SUM(uv) as uv,
> round(SUM(cost),2) as total, round(SUM(cost)/SUM(uv),2) FROM
> tmp_adclick_sellplat """)
> 14/09/02 15:32:28 INFO ParseDriver: Parse Completed
> java.lang.RuntimeException:
> Unsupported language features in query: CREATE TABLE
> tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
> TERMINATED BY 'abc' as  SELECT SUM(uv) as uv, round(SUM(cost),2) as
> total, round(SUM(cost)/SUM(uv),2) FROM tmp_adclick_sellplat
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:255)
> at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75)
> at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib

2014-10-28 Thread Brian Gawalt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Gawalt updated SPARK-2335:

Labels: features  (was: features newbie)

> k-Nearest Neighbor classification and regression for MLLib
> --
>
> Key: SPARK-2335
> URL: https://issues.apache.org/jira/browse/SPARK-2335
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Brian Gawalt
>Priority: Minor
>  Labels: features
>
> The k-Nearest Neighbor model for classification and regression problems is a 
> simple and intuitive approach, offering a straightforward path to creating 
> non-linear decision/estimation contours. It's downsides -- high variance 
> (sensitivity to the known training data set) and computational intensity for 
> estimating new point labels -- both play to Spark's big data strengths: lots 
> of data mitigates data concerns; lots of workers mitigate computational 
> latency. 
> We should include kNN models as options in MLLib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187519#comment-14187519
 ] 

Nicholas Chammas commented on SPARK-3398:
-

[~michael.griffiths] - 
[{{wait_for_cluster_state}}|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark_ec2.py#L634]
 will take any of the valid boto states, plus {{ssh-ready}}. {{ssh-ready}} is 
not a boto state, but rather a handy label for a relevant state that we want to 
wait for. {{spark-ec2}} manually checks for this state by testing SSH 
availability on each of the nodes in the cluster.

How are you invoking {{spark-ec2}}? Sometimes instances can take a few minutes 
before SSH becomes available. How long have you waited?

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-28 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187537#comment-14187537
 ] 

Michael Armbrust commented on SPARK-3683:
-

[~davies] there used to be some explicit code that checked for "NULL", but I 
can't find it anymore, so you are right this problem might exist in scala too.  
However, I can't reproduce it as most serdes seem to store null as "\N".  Some 
sample code to reproduce the issue would be helpful.

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Michael Griffiths (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187543#comment-14187543
 ] 

Michael Griffiths commented on SPARK-3398:
--

I waited until all the servers (11) were up according to AWS Console, then ran 
the command again with --resume. After that, I waited 10 minutes.

Then I went in, changed the check to "running", and it worked fine.

I'll check my setup (invoking on an Ubuntu server). It's certainly possible 
there's something wrong there. 

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187553#comment-14187553
 ] 

Nicholas Chammas commented on SPARK-3398:
-

Hmm, I'm curious:
# Why did you have to run {{spark-ec2}} again with {{--resume}}?
# Are you using an AMI other than the standard one?
# If yes, do you know what shell that AMI defaults to? What does {{true ; echo 
$?}} return on that shell?

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-28 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187563#comment-14187563
 ] 

Davies Liu commented on SPARK-3683:
---

This commit remove the special case for "NULL":
{code}
commit cf989601d0e784e1c3507720e64636891fe28292
Author: Cheng Lian 
Date:   Fri May 30 22:13:11 2014 -0700

[SPARK-1959] String "NULL" shouldn't be interpreted as null value

JIRA issue: [SPARK-1959](https://issues.apache.org/jira/browse/SPARK-1959)

Author: Cheng Lian 

Closes #909 from liancheng/spark-1959 and squashes the following commits:

306659c [Cheng Lian] [SPARK-1959] String "NULL" shouldn't be interpreted as 
null value

diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala
index f141139..d263c31 100644
--- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala
+++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala
@@ -113,7 +113,6 @@ case class HiveTableScan(
   }

   private def unwrapHiveData(value: Any) = value match {
-case maybeNull: String if maybeNull.toLowerCase == "null" => null
 case varchar: HiveVarchar => varchar.getValue
 case decimal: HiveDecimal => BigDecimal(decimal.bigDecimalValue)
 case other => other
{code}

So this should be a bug from Hive.

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Michael Griffiths (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187566#comment-14187566
 ] 

Michael Griffiths commented on SPARK-3398:
--

In order - 

 # I tried a few times; it kept failing. Ultimately I ran it once to setup the 
instances, and then waited to ensure I could SSH into the manually before 
running again.

# No, I'm using the default AMI. The only parameters I'm passing are the SSH 
keyname, the key file, and cluster name.

# {{true ; echo $?}} returns 0.

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None

2014-10-28 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187570#comment-14187570
 ] 

Michael Armbrust commented on SPARK-3683:
-

Good find!  [~jamborta] if you can give code that shows Hive does interpret 
this as null I'd consider adding it back for compatibility otherwise it seems 
like this is expected behavior.  /cc [~liancheng]

> PySpark Hive query generates "NULL" instead of None
> ---
>
> Key: SPARK-3683
> URL: https://issues.apache.org/jira/browse/SPARK-3683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Tamas Jambor
>Assignee: Davies Liu
>
> When I run a Hive query in Spark SQL, I get the new Row object, where it does 
> not convert Hive NULL into Python None instead it keeps it string 'NULL'. 
> It's only an issue with String type, works with other types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4123) Show new dependencies added in pull requests

2014-10-28 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-4123:
--

 Summary: Show new dependencies added in pull requests
 Key: SPARK-4123
 URL: https://issues.apache.org/jira/browse/SPARK-4123
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Priority: Critical


We should inspect the classpath of Spark's assembly jar for every pull request. 
This only takes a few seconds in Maven and it will help weed out dependency 
changes from the master branch. Ideally we'd post any dependency changes in the 
pull request message.

{code}
$ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
git checkout apache/master
$ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
$ diff my-classpath master-classpath
< chill-java-0.3.6.jar
< chill_2.10-0.3.6.jar
---
> chill-java-0.5.0.jar
> chill_2.10-0.5.0.jar
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4123) Show new dependencies added in pull requests

2014-10-28 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4123:
---
Description: 
We should inspect the classpath of Spark's assembly jar for every pull request. 
This only takes a few seconds in Maven and it will help weed out dependency 
changes from the master branch. Ideally we'd post any dependency changes in the 
pull request message.

{code}
$ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
$ git checkout apache/master
$ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
$ diff my-classpath master-classpath
< chill-java-0.3.6.jar
< chill_2.10-0.3.6.jar
---
> chill-java-0.5.0.jar
> chill_2.10-0.5.0.jar
{code}

  was:
We should inspect the classpath of Spark's assembly jar for every pull request. 
This only takes a few seconds in Maven and it will help weed out dependency 
changes from the master branch. Ideally we'd post any dependency changes in the 
pull request message.

{code}
$ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
git checkout apache/master
$ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
$ diff my-classpath master-classpath
< chill-java-0.3.6.jar
< chill_2.10-0.3.6.jar
---
> chill-java-0.5.0.jar
> chill_2.10-0.5.0.jar
{code}


> Show new dependencies added in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4084) Reuse sort key in Sorter

2014-10-28 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-4084.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

> Reuse sort key in Sorter
> 
>
> Key: SPARK-4084
> URL: https://issues.apache.org/jira/browse/SPARK-4084
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.2.0
>
>
> Sorter uses generic-typed key for sorting. When data is large, it creates 
> lots of key objects, which is not efficient. We should reuse the key in 
> Sorter for memory efficiency. This change is part of the petabyte sort 
> implementation from [~rxin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests

2014-10-28 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187590#comment-14187590
 ] 

Patrick Wendell commented on SPARK-4123:


[~nchammas] - do you have any interest in doing this one? 

> Show new dependencies added in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states

2014-10-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187593#comment-14187593
 ] 

Nicholas Chammas commented on SPARK-3398:
-

OK, so you're invoking {{spark-ec2}} from an Ubuntu server. I wonder if that 
matters any, specifically when we make [this 
call|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark_ec2.py#L615].

What happens if you replace the code at that line with this version?

{code}
ret = subprocess.check_call(
ssh_command(opts) + ['-t', '-t', '-o', 'ConnectTimeout=3',
 '%s@%s' % (opts.user, host), 
stringify_command('true')]
)
{code}

This will just print SSH's output to the screen instead of suppressing it. If 
anything's going wrong, it should be more obvious that way.

> Have spark-ec2 intelligently wait for specific cluster states
> -
>
> Key: SPARK-3398
> URL: https://issues.apache.org/jira/browse/SPARK-3398
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 1.2.0
>
>
> {{spark-ec2}} currently has retry logic for when it tries to install stuff on 
> a cluster and for when it tries to destroy security groups. 
> It would be better to have some logic that allows {{spark-ec2}} to explicitly 
> wait for when all the nodes in a cluster it is working on have reached a 
> specific state.
> Examples:
> * Wait for all nodes to be up
> * Wait for all nodes to be up and accepting SSH connections (then start 
> installing stuff)
> * Wait for all nodes to be down
> * Wait for all nodes to be terminated (then delete the security groups)
> Having a function in the {{spark_ec2.py}} script that blocks until the 
> desired cluster state is reached would reduce the need for various retry 
> logic. It would probably also eliminate the need for the {{--wait}} parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >