[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187768#comment-14187768 ] Cheng Lian commented on SPARK-3683: --- Actually a Hive session was illustrated in SPARK-1959, and seems that Hive interprets {{"NULL"}} as a literal string whose contents is "NULL" rather than a null value. > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2215) Multi-way join
[ https://issues.apache.org/jira/browse/SPARK-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187798#comment-14187798 ] Reynold Xin commented on SPARK-2215: I think a simplified version of the multi-way join would make sense, i.e. one that does multi-way inner-equi-broadcast join. > Multi-way join > -- > > Key: SPARK-2215 > URL: https://issues.apache.org/jira/browse/SPARK-2215 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao >Priority: Minor > > Support the multi-way join (multiple table joins) in a single reduce stage if > they have the same join keys. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187832#comment-14187832 ] zzc commented on SPARK-2468: Hi,Reynold Xin, When version 1.2 released, approximate time? > Netty-based block server / client module > > > Key: SPARK-2468 > URL: https://issues.apache.org/jira/browse/SPARK-2468 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > Right now shuffle send goes through the block manager. This is inefficient > because it requires loading a block from disk into a kernel buffer, then into > a user space buffer, and then back to a kernel send buffer before it reaches > the NIC. It does multiple copies of the data and context switching between > kernel/user. It also creates unnecessary buffer in the JVM that increases GC > Instead, we should use FileChannel.transferTo, which handles this in the > kernel space with zero-copy. See > http://www.ibm.com/developerworks/library/j-zerocopy/ > One potential solution is to use Netty. Spark already has a Netty based > network module implemented (org.apache.spark.network.netty). However, it > lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187837#comment-14187837 ] Reynold Xin commented on SPARK-2468: Take a look here https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage > Netty-based block server / client module > > > Key: SPARK-2468 > URL: https://issues.apache.org/jira/browse/SPARK-2468 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > Right now shuffle send goes through the block manager. This is inefficient > because it requires loading a block from disk into a kernel buffer, then into > a user space buffer, and then back to a kernel send buffer before it reaches > the NIC. It does multiple copies of the data and context switching between > kernel/user. It also creates unnecessary buffer in the JVM that increases GC > Instead, we should use FileChannel.transferTo, which handles this in the > kernel space with zero-copy. See > http://www.ibm.com/developerworks/library/j-zerocopy/ > One potential solution is to use Netty. Spark already has a Netty based > network module implemented (org.apache.spark.network.netty). However, it > lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187861#comment-14187861 ] Nicholas Chammas commented on SPARK-3398: - So I spun up an Ubuntu server on EC2 and was able to reproduce this issue. For some reason, the call to SSH in the [referenced line|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark_ec2.py#L615] fails because it can't find the {{pem}} file passed in to {{spark-ec2}}. Strange. I'm looking into why. > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3466) Limit size of results that a driver collects for each action
[ https://issues.apache.org/jira/browse/SPARK-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3466: - Priority: Critical (was: Major) > Limit size of results that a driver collects for each action > > > Key: SPARK-3466 > URL: https://issues.apache.org/jira/browse/SPARK-3466 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Assignee: Davies Liu >Priority: Critical > > Right now, operations like {{collect()}} and {{take()}} can crash the driver > with an OOM if they bring back too many data. We should add a > {{spark.driver.maxResultSize}} setting (or something like that) that will > make the driver abort a job if its result is too big. We can set it to some > fraction of the driver's memory by default, or to something like 100 MB. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187881#comment-14187881 ] zzc commented on SPARK-2468: thanks > Netty-based block server / client module > > > Key: SPARK-2468 > URL: https://issues.apache.org/jira/browse/SPARK-2468 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > Right now shuffle send goes through the block manager. This is inefficient > because it requires loading a block from disk into a kernel buffer, then into > a user space buffer, and then back to a kernel send buffer before it reaches > the NIC. It does multiple copies of the data and context switching between > kernel/user. It also creates unnecessary buffer in the JVM that increases GC > Instead, we should use FileChannel.transferTo, which handles this in the > kernel space with zero-copy. See > http://www.ibm.com/developerworks/library/j-zerocopy/ > One potential solution is to use Netty. Spark already has a Netty based > network module implemented (org.apache.spark.network.netty). However, it > lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4126) Do not set `spark.executor.instances` if not needed (yarn-cluster)
Andrew Or created SPARK-4126: Summary: Do not set `spark.executor.instances` if not needed (yarn-cluster) Key: SPARK-4126 URL: https://issues.apache.org/jira/browse/SPARK-4126 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor In yarn cluster mode, we currently always set `spark.executor.instances` regardless of whether this is set by the user. While not a huge deal, this prevents us from knowing whether the user did specify a starting number of executors. This is needed in SPARK-3795 to throw the appropriate exception when this is set AND dynamic executor allocation is turned on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187898#comment-14187898 ] Nicholas Chammas commented on SPARK-3398: - I think I've found the issue. It doesn't have anything to do with Ubuntu or with {{wait_for_cluster_state}}. [~michael.griffiths] - Did {{spark-ec2 launch --resume}} and {{spark-ec2 login}} ultimately work for you to the point where you had a working Spark EC2 cluster? Or are you not sure if in the end you were able to get a working cluster? What I'm seeing is that the issue is specifying the path to the SSH Identity file relative to the current working directory vs. absolutely. Do you still see the same issue if you specify the path to the Identity file absolutely? That is: {code} # Currently not working spark-ec2 -i ../my.pem {code} {code} # Should work spark-ec2 -i ~/my.pem spark-ec2 -i /home/me/my.pem {code} > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187900#comment-14187900 ] Nicholas Chammas commented on SPARK-3398: - If that fixes it for you, then I think the solution is simple. We just need to set {{cwd}} to the user's current working directory in all our calls to [{{subprocess.check_call()}}|https://docs.python.org/2/library/subprocess.html#subprocess.check_call]. Right now it defaults to the {{spark-ec2}} directory, which will be problematic if you call {{spark-ec2}} from another directory. > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3904) HQL doesn't support the ConstantObjectInspector to pass into UDFs
[ https://issues.apache.org/jira/browse/SPARK-3904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3904. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2762 [https://github.com/apache/spark/pull/2762] > HQL doesn't support the ConstantObjectInspector to pass into UDFs > - > > Key: SPARK-3904 > URL: https://issues.apache.org/jira/browse/SPARK-3904 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao > Fix For: 1.2.0 > > > In HQL, we convert all of the data type into normal ObjectInspectors for > UDFs, most of cases it work, however, some of the UDF actually requires the > input ObjectInspector to be the ConstantObjectInspector, which will cause > exception. > e.g. > {panel} > select named_struct("x", "str") from src limit 1 > {panel} > It will throws exception like > {panel} > 14/10/10 16:25:17 INFO parse.ParseDriver: Parsing command: select > named_struct("x", "str") from src > 14/10/10 16:25:17 INFO parse.ParseDriver: Parse Completed > 14/10/10 16:25:17 INFO metastore.HiveMetaStore: 0: get_table : db=default > tbl=src > 14/10/10 16:25:17 INFO HiveMetaStore.audit: ugi=hcheng > ip=unknown-ip-addr cmd=get_table : db=default tbl=tmp2 > 14/10/10 16:25:17 ERROR thriftserver.SparkSQLDriver: Failed in [select > named_struct("x", "str") from src] > org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: Even arguments to > NAMED_STRUCT must be a constant > STRING.org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaStringObjectInspector@2f2dbcfc > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFNamedStruct.initialize(GenericUDFNamedStruct.java:55) > at > org.apache.spark.sql.hive.HiveGenericUdf.returnInspector$lzycompute(hiveUdfs.scala:129) > at > org.apache.spark.sql.hive.HiveGenericUdf.returnInspector(hiveUdfs.scala:129) > at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:158) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$6$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:267) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$6$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:260) > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4127) Streaming Linear Regression
Anant Daksh Asthana created SPARK-4127: -- Summary: Streaming Linear Regression Key: SPARK-4127 URL: https://issues.apache.org/jira/browse/SPARK-4127 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create python bindings for Streaming Linear Regression (MLlib). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4127) Streaming Linear Regression
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187906#comment-14187906 ] Anant Daksh Asthana commented on SPARK-4127: [~mengxr] [~freeman-lab] Just added this issue. Could you please assign it to me. Thanks > Streaming Linear Regression > --- > > Key: SPARK-4127 > URL: https://issues.apache.org/jira/browse/SPARK-4127 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Anant Daksh Asthana >Priority: Minor > > Create python bindings for Streaming Linear Regression (MLlib). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4127) Streaming Linear Regression
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana updated SPARK-4127: --- Description: Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found (here)[https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala] was:Create python bindings for Streaming Linear Regression (MLlib). > Streaming Linear Regression > --- > > Key: SPARK-4127 > URL: https://issues.apache.org/jira/browse/SPARK-4127 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Anant Daksh Asthana >Priority: Minor > > Create python bindings for Streaming Linear Regression (MLlib). > The Mllib file relevant to this issue can be found > (here)[https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4127) Streaming Linear Regression
[ https://issues.apache.org/jira/browse/SPARK-4127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana updated SPARK-4127: --- Description: Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found at : https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala was: Create python bindings for Streaming Linear Regression (MLlib). The Mllib file relevant to this issue can be found (here)[https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala] > Streaming Linear Regression > --- > > Key: SPARK-4127 > URL: https://issues.apache.org/jira/browse/SPARK-4127 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Anant Daksh Asthana >Priority: Minor > > Create python bindings for Streaming Linear Regression (MLlib). > The Mllib file relevant to this issue can be found at : > https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4113) Pyhon UDF on ArrayType
[ https://issues.apache.org/jira/browse/SPARK-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-4113. - Resolution: Fixed Fix Version/s: 1.2.0 > Pyhon UDF on ArrayType > -- > > Key: SPARK-4113 > URL: https://issues.apache.org/jira/browse/SPARK-4113 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.2.0 > > > from Matei: > I have a table where column c is of type array. However the following > set of commands fails: > sqlContext.registerFunction("py_func", lambda a: len(a)) > %sql select py_func(c) from some_temp > Error in SQL statement: java.lang.RuntimeException: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 252.0 failed 4 times, most recent failure: Lost task 2.3 in stage 252.0 > (TID 8454, ip-10-0-157-104.us-west-2.compute.internal): > net.razorvine.pickle.PickleException: couldn't introspect javabean: > java.lang.IllegalArgumentException: wrong number of arguments > net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.dump(Pickler.java:95) > The same function works if I select a Row from my table into Python and call > it on its third column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187900#comment-14187900 ] Nicholas Chammas edited comment on SPARK-3398 at 10/29/14 2:48 AM: --- If that fixes it for you, then I think the solution is simple. -We just need to set {{cwd}} to the user's current working directory in all our calls to [{{subprocess.check_call()}}|https://docs.python.org/2/library/subprocess.html#subprocess.check_call]. Right now it defaults to the {{spark-ec2}} directory, which will be problematic if you call {{spark-ec2}} from another directory.- We need to fix [how the script gets called here|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark-ec2#L22]. was (Author: nchammas): If that fixes it for you, then I think the solution is simple. We just need to set {{cwd}} to the user's current working directory in all our calls to [{{subprocess.check_call()}}|https://docs.python.org/2/library/subprocess.html#subprocess.check_call]. Right now it defaults to the {{spark-ec2}} directory, which will be problematic if you call {{spark-ec2}} from another directory. > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4120) Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187929#comment-14187929 ] Apache Spark commented on SPARK-4120: - User 'ravipesala' has created a pull request for this issue: https://github.com/apache/spark/pull/2987 > Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not > work in SparkSQL > > > Key: SPARK-4120 > URL: https://issues.apache.org/jira/browse/SPARK-4120 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ravindra Pesala >Assignee: Ravindra Pesala > Fix For: 1.2.0 > > > The queries with more than like 2 tables does not work. > {code} > sql("SELECT * FROM records1 as a,records2 as b,records3 as c where > a.key=b.key and a.key=c.key") > {code} > The above query gives following exception. > {code} > Exception in thread "main" java.lang.RuntimeException: [1.40] failure: > ``UNION'' expected but `,' found > SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and > a.key=c.key >^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) > at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4008) Fix "kryo with fold" in KryoSerializerSuite
[ https://issues.apache.org/jira/browse/SPARK-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-4008. - Resolution: Fixed Fix Version/s: 1.2.0 > Fix "kryo with fold" in KryoSerializerSuite > --- > > Key: SPARK-4008 > URL: https://issues.apache.org/jira/browse/SPARK-4008 > Project: Spark > Issue Type: Test > Components: Spark Core >Reporter: Shixiong Zhu >Priority: Minor > Labels: unit-test > Fix For: 1.2.0 > > > "kryo with fold" in KryoSerializerSuite is disabled now. It can be fixed by > changing the zeroValue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3008) PySpark fails due to zipimport not able to load the assembly jar (/usr/bin/python: No module named pyspark)
[ https://issues.apache.org/jira/browse/SPARK-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jai Kumar Singh closed SPARK-3008. -- > PySpark fails due to zipimport not able to load the assembly jar > (/usr/bin/python: No module named pyspark) > > > Key: SPARK-3008 > URL: https://issues.apache.org/jira/browse/SPARK-3008 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: Assemebly Jar > target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.2.0.jar > jar -tf > assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.2.0.jar | wc > -l > 70441 > git sha commit ba28a8fcbc3ba432e7ea4d6f0b535450a6ec96c6 >Reporter: Jai Kumar Singh > Labels: pyspark > > PySpark is not working. It fails because zipimport not able to import > assembly jar because that contain more than 65536 files. > Email chains in this regard are below > http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3ccamjob8kcgk0pqiogju6uokceyswcusw3xwd5wrs8ikpmgd2...@mail.gmail.com%3E > https://mail.python.org/pipermail/python-list/2014-May/671353.html > Is there any work around to bypass the issue ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1442) Add Window function support
[ https://issues.apache.org/jira/browse/SPARK-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guowei updated SPARK-1442: -- Attachment: Window Function.pdf > Add Window function support > --- > > Key: SPARK-1442 > URL: https://issues.apache.org/jira/browse/SPARK-1442 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Chengxiang Li > Attachments: Window Function.pdf > > > similiar to Hive, add window function support for catalyst. > https://issues.apache.org/jira/browse/HIVE-4197 > https://issues.apache.org/jira/browse/HIVE-896 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4094) checkpoint should still be available after rdd actions
[ https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187982#comment-14187982 ] Zhang, Liye commented on SPARK-4094: [SPARK-3625|https://issues.apache.org/jira/browse/SPARK-3625] did something similar with this issue, but currently it does not support case like this: *rdd0 = sc.makeRDD(...)* *rdd1 = rdd0.flatmap(...)* *rdd1.collect()* *rdd0.checkpoint()* *rdd1.count()* In which *rdd0* would not be checkpointed. In this JIRA, we will always traverse the whole rdd lineage for any rdd actions, until encounter the rdds that has already been checkpointed. Since the traverse only check for the status of rdds, the operations will not introduce much impact on the performance. > checkpoint should still be available after rdd actions > -- > > Key: SPARK-4094 > URL: https://issues.apache.org/jira/browse/SPARK-4094 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Zhang, Liye > > rdd.checkpoint() must be called before any actions on this rdd, if there is > any other actions before, checkpoint would never succeed. For the following > code as example: > *rdd = sc.makeRDD(...)* > *rdd.collect()* > *rdd.checkpoint()* > *rdd.count()* > This rdd would never be checkpointed. But this would not happen for RDD > cache. RDD cache would always make successfully before rdd actions no matter > whether there is any actions before cache(). > So rdd.checkpoint() should also be with the same behavior with rdd.cache(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4128) Create instructions on fully building Spark in Intellij
Patrick Wendell created SPARK-4128: -- Summary: Create instructions on fully building Spark in Intellij Key: SPARK-4128 URL: https://issues.apache.org/jira/browse/SPARK-4128 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Patrick Wendell Priority: Blocker With some of our more complicated modules, I'm not sure whether Intellij correctly understands all source locations. Also, we might require specifying some profiles for the build to work directly. We should document clearly how to start with vanilla Spark master and get the entire thing building in Intellij. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4062) Improve KafkaReceiver to prevent data loss
[ https://issues.apache.org/jira/browse/SPARK-4062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188008#comment-14188008 ] Apache Spark commented on SPARK-4062: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/2991 > Improve KafkaReceiver to prevent data loss > -- > > Key: SPARK-4062 > URL: https://issues.apache.org/jira/browse/SPARK-4062 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Saisai Shao > Attachments: RefactoredKafkaReceiver.pdf > > > Current KafkaReceiver has data loss and data re-consuming problem. Here we > propose a ReliableKafkaReceiver to improving its reliability and fault > tolerance with the power of Spark Streaming's WAL mechanism. > This is a follow up work of SPARK-3129. Design doc is posted, any comments > would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4113) Pyhon UDF on ArrayType
[ https://issues.apache.org/jira/browse/SPARK-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188007#comment-14188007 ] Apache Spark commented on SPARK-4113: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2973 > Pyhon UDF on ArrayType > -- > > Key: SPARK-4113 > URL: https://issues.apache.org/jira/browse/SPARK-4113 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > Fix For: 1.2.0 > > > from Matei: > I have a table where column c is of type array. However the following > set of commands fails: > sqlContext.registerFunction("py_func", lambda a: len(a)) > %sql select py_func(c) from some_temp > Error in SQL statement: java.lang.RuntimeException: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 252.0 failed 4 times, most recent failure: Lost task 2.3 in stage 252.0 > (TID 8454, ip-10-0-157-104.us-west-2.compute.internal): > net.razorvine.pickle.PickleException: couldn't introspect javabean: > java.lang.IllegalArgumentException: wrong number of arguments > net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:392) > net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) > net.razorvine.pickle.Pickler.save(Pickler.java:125) > net.razorvine.pickle.Pickler.dump(Pickler.java:95) > The same function works if I select a Row from my table into Python and call > it on its third column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets
[ https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188012#comment-14188012 ] Xiangrui Meng commented on SPARK-3080: -- Thanks for confirming the issue! I guess this could be a serialization issue. Did you observe any executor loss during the computation or in-memory cached RDDs switching to on-disk storage? [~derenrich] Which public dataset are you using? Could you also let me know all the ALS parameters and custom Spark settings you used? Thanks! [~ilganeli] If you do need to run ALS on the full dataset, I recommend using the new ALS implementation at https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala It should perform better. But it is not merged yet. > ArrayIndexOutOfBoundsException in ALS for Large datasets > > > Key: SPARK-3080 > URL: https://issues.apache.org/jira/browse/SPARK-3080 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Burak Yavuz > > The stack trace is below: > {quote} > java.lang.ArrayIndexOutOfBoundsException: 2716 > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543) > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505) > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > {quote} > This happened after the dataset was sub-sampled. > Dataset properties: ~12B ratings > Setup: 55 r3.8xlarge ec2 instances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer
DB Tsai created SPARK-4129: -- Summary: Performance tuning in MultivariateOnlineSummarizer Key: SPARK-4129 URL: https://issues.apache.org/jira/browse/SPARK-4129 Project: Spark Issue Type: Improvement Components: MLlib Reporter: DB Tsai In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop through the nonZero elements in the vector. However, activeIterator doesn't perform well due to lots of overhead. In this PR, native while loop is used for both DenseVector and SparseVector. The benchmark result with 20 executors using mnist8m dataset: Before: DenseVector: 48.2 seconds SparseVector: 16.3 seconds After: DenseVector: 17.8 seconds SparseVector: 11.2 seconds Since MultivariateOnlineSummarizer is used in several places, the overall performance gain in mllib library will be significant with this PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4129) Performance tuning in MultivariateOnlineSummarizer
[ https://issues.apache.org/jira/browse/SPARK-4129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188037#comment-14188037 ] Apache Spark commented on SPARK-4129: - User 'dbtsai' has created a pull request for this issue: https://github.com/apache/spark/pull/2992 > Performance tuning in MultivariateOnlineSummarizer > -- > > Key: SPARK-4129 > URL: https://issues.apache.org/jira/browse/SPARK-4129 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: DB Tsai > > In MultivariateOnlineSummarizer, breeze's activeIterator is used to loop > through the nonZero elements in the vector. However, activeIterator doesn't > perform well due to lots of overhead. In this PR, native while loop is used > for both DenseVector and SparseVector. > The benchmark result with 20 executors using mnist8m dataset: > Before: > DenseVector: 48.2 seconds > SparseVector: 16.3 seconds > After: > DenseVector: 17.8 seconds > SparseVector: 11.2 seconds > Since MultivariateOnlineSummarizer is used in several places, the overall > performance gain in mllib library will be significant with this PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4122) Add library to write data back to Kafka
[ https://issues.apache.org/jira/browse/SPARK-4122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188052#comment-14188052 ] Apache Spark commented on SPARK-4122: - User 'harishreedharan' has created a pull request for this issue: https://github.com/apache/spark/pull/2994 > Add library to write data back to Kafka > --- > > Key: SPARK-4122 > URL: https://issues.apache.org/jira/browse/SPARK-4122 > Project: Spark > Issue Type: Bug >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4111) [MLlib] Implement regression model evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-4111: --- Summary: [MLlib] Implement regression model evaluation metrics (was: Implement regression model evaluation metrics) > [MLlib] Implement regression model evaluation metrics > - > > Key: SPARK-4111 > URL: https://issues.apache.org/jira/browse/SPARK-4111 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Yanbo Liang > > Supervised machine learning include classification and regression. There is > classification metrics (BinaryClassificationMetrics) in MLlib, we also need > regression metrics to evaluate the regression model and tunning parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4111) [MLlib] Implement regression model evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186496#comment-14186496 ] Sean Owen commented on SPARK-4111: -- Is this more than just MAE / RMSE / R2? It might be handy to have a little utility class for these although they're almost one-liners already. > [MLlib] Implement regression model evaluation metrics > - > > Key: SPARK-4111 > URL: https://issues.apache.org/jira/browse/SPARK-4111 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Yanbo Liang > > Supervised machine learning include classification and regression. There is > classification metrics (BinaryClassificationMetrics) in MLlib, we also need > regression metrics to evaluate the regression model and tunning parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4115) add overrided count for EdgeRDD
Lu Lu created SPARK-4115: Summary: add overrided count for EdgeRDD Key: SPARK-4115 URL: https://issues.apache.org/jira/browse/SPARK-4115 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.1.0 Reporter: Lu Lu Priority: Minor Fix For: 1.1.1 Add overrided count for edge counting of EdgeRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4115) [GraphX] add overrided count for EdgeRDD
[ https://issues.apache.org/jira/browse/SPARK-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Lu updated SPARK-4115: - Summary: [GraphX] add overrided count for EdgeRDD (was: add overrided count for EdgeRDD) > [GraphX] add overrided count for EdgeRDD > > > Key: SPARK-4115 > URL: https://issues.apache.org/jira/browse/SPARK-4115 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.1.0 >Reporter: Lu Lu >Priority: Minor > Fix For: 1.1.1 > > > Add overrided count for edge counting of EdgeRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4107) Incorrect handling of Channel.read()'s return value may lead to data truncation
[ https://issues.apache.org/jira/browse/SPARK-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186506#comment-14186506 ] Apache Spark commented on SPARK-4107: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/2974 > Incorrect handling of Channel.read()'s return value may lead to data > truncation > --- > > Key: SPARK-4107 > URL: https://issues.apache.org/jira/browse/SPARK-4107 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > When using {{Channel.read()}}, we need to properly handle the return value > and account for the case where we've read fewer bytes than expected. There > are a few places where we don't do this properly, which may lead to incorrect > data truncation in rare circumstances. I've opened a PR to fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4111) [MLlib] Implement regression model evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186514#comment-14186514 ] Yanbo Liang commented on SPARK-4111: We had implement regression metrics such as explained variance score, MAE, MSE and R2 score to evaluate the regression model. If there is no evaluation metrics, users can not know stand or fall of this model and tuning parameter for better result. I will submit PR for this issue. > [MLlib] Implement regression model evaluation metrics > - > > Key: SPARK-4111 > URL: https://issues.apache.org/jira/browse/SPARK-4111 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Yanbo Liang > > Supervised machine learning include classification and regression. There is > classification metrics (BinaryClassificationMetrics) in MLlib, we also need > regression metrics to evaluate the regression model and tunning parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4115) [GraphX] add overrided count for EdgeRDD
[ https://issues.apache.org/jira/browse/SPARK-4115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186518#comment-14186518 ] Apache Spark commented on SPARK-4115: - User 'luluorta' has created a pull request for this issue: https://github.com/apache/spark/pull/2975 > [GraphX] add overrided count for EdgeRDD > > > Key: SPARK-4115 > URL: https://issues.apache.org/jira/browse/SPARK-4115 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.1.0 >Reporter: Lu Lu >Priority: Minor > Fix For: 1.1.1 > > > Add overrided count for edge counting of EdgeRDD. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4111) [MLlib] Implement regression model evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186514#comment-14186514 ] Yanbo Liang edited comment on SPARK-4111 at 10/28/14 7:37 AM: -- We had implement regression metrics such as explained variance score, MAE, MSE and R2 score to evaluate the regression model. If there is no evaluation metrics, users can not know stand or fall of this model and tuning parameter for better result. I will submit PR for this issue. Can I have this assigned to me? was (Author: yanboliang): We had implement regression metrics such as explained variance score, MAE, MSE and R2 score to evaluate the regression model. If there is no evaluation metrics, users can not know stand or fall of this model and tuning parameter for better result. I will submit PR for this issue. > [MLlib] Implement regression model evaluation metrics > - > > Key: SPARK-4111 > URL: https://issues.apache.org/jira/browse/SPARK-4111 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Yanbo Liang > > Supervised machine learning include classification and regression. There is > classification metrics (BinaryClassificationMetrics) in MLlib, we also need > regression metrics to evaluate the regression model and tunning parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186562#comment-14186562 ] Sandy Ryza commented on SPARK-3461: --- SPARK-2926 could help with this as well. > Support external groupByKey using repartitionAndSortWithinPartitions > > > Key: SPARK-3461 > URL: https://issues.apache.org/jira/browse/SPARK-3461 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Davies Liu >Priority: Critical > > Given that we have SPARK-2978, it seems like we could support an external > group by operator pretty easily. We'd just have to wrap the existing iterator > exposed by SPARK-2978 with a lookahead iterator that detects the group > boundaries. Also, we'd have to override the cache() operator to cache the > parent RDD so that if this object is cached it doesn't wind through the > iterator. > I haven't totally followed all the sort-shuffle internals, but just given the > stated semantics of SPARK-2978 it seems like this would be possible. > It would be really nice to externalize this because many beginner users write > jobs in terms of groupByKey. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4116) Delete the abandoned log4j-spark-container.properties
WangTaoTheTonic created SPARK-4116: -- Summary: Delete the abandoned log4j-spark-container.properties Key: SPARK-4116 URL: https://issues.apache.org/jira/browse/SPARK-4116 Project: Spark Issue Type: Improvement Components: YARN Reporter: WangTaoTheTonic Priority: Minor Seems like the properties file was abandoned, we could delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4116) Delete the abandoned log4j-spark-container.properties
[ https://issues.apache.org/jira/browse/SPARK-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186641#comment-14186641 ] Apache Spark commented on SPARK-4116: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/2977 > Delete the abandoned log4j-spark-container.properties > - > > Key: SPARK-4116 > URL: https://issues.apache.org/jira/browse/SPARK-4116 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: WangTaoTheTonic >Priority: Minor > > Seems like the properties file was abandoned, we could delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4038) Outlier Detection Algorithm for MLlib
[ https://issues.apache.org/jira/browse/SPARK-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186656#comment-14186656 ] Ashutosh Trivedi commented on SPARK-4038: - tagging [~Kaushik619] As he is also working with me on this. > Outlier Detection Algorithm for MLlib > - > > Key: SPARK-4038 > URL: https://issues.apache.org/jira/browse/SPARK-4038 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Ashutosh Trivedi >Priority: Minor > > The aim of this JIRA is to discuss about which parallel outlier detection > algorithms can be included in MLlib. > The one which I am familiar with is Attribute Value Frequency (AVF). It > scales linearly with the number of data points and attributes, and relies on > a single data scan. It is not distance based and well suited for categorical > data. In original paper a parallel version is also given, which is not > complected to implement. I am working on the implementation and soon submit > the initial code for review. > Here is the Link for the paper > http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4410382 > As pointed out by Xiangrui in discussion > http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-td8880.html > There are other algorithms also. Lets discuss about which will be more > general and easily paralleled. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186659#comment-14186659 ] Ashutosh Trivedi commented on SPARK-2335: - Thanks [~bgawalt] and [~slcclimber] for helping us out. Looking forward to working with you guys. Tagging [~Kaushik619] here, as he is also working with me. We will be giving inputs here soon. > k-Nearest Neighbor classification and regression for MLLib > -- > > Key: SPARK-2335 > URL: https://issues.apache.org/jira/browse/SPARK-2335 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: features, newbie > > The k-Nearest Neighbor model for classification and regression problems is a > simple and intuitive approach, offering a straightforward path to creating > non-linear decision/estimation contours. It's downsides -- high variance > (sensitivity to the known training data set) and computational intensity for > estimating new point labels -- both play to Spark's big data strengths: lots > of data mitigates data concerns; lots of workers mitigate computational > latency. > We should include kNN models as options in MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3961) Python API for mllib.feature
[ https://issues.apache.org/jira/browse/SPARK-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-3961. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2819 [https://github.com/apache/spark/pull/2819] > Python API for mllib.feature > > > Key: SPARK-3961 > URL: https://issues.apache.org/jira/browse/SPARK-3961 > Project: Spark > Issue Type: New Feature >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.2.0 > > > Add completed API for mllib.feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4111) [MLlib] Implement regression model evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186710#comment-14186710 ] Yanbo Liang commented on SPARK-4111: https://github.com/apache/spark/pull/2978 > [MLlib] Implement regression model evaluation metrics > - > > Key: SPARK-4111 > URL: https://issues.apache.org/jira/browse/SPARK-4111 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Yanbo Liang > > Supervised machine learning include classification and regression. There is > classification metrics (BinaryClassificationMetrics) in MLlib, we also need > regression metrics to evaluate the regression model and tunning parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark
[ https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186711#comment-14186711 ] Apache Spark commented on SPARK-2352: - User 'bgreeven' has created a pull request for this issue: https://github.com/apache/spark/pull/1290 > [MLLIB] Add Artificial Neural Network (ANN) to Spark > > > Key: SPARK-2352 > URL: https://issues.apache.org/jira/browse/SPARK-2352 > Project: Spark > Issue Type: New Feature > Components: MLlib > Environment: MLLIB code >Reporter: Bert Greevenbosch >Assignee: Bert Greevenbosch > > It would be good if the Machine Learning Library contained Artificial Neural > Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4111) [MLlib] Implement regression model evaluation metrics
[ https://issues.apache.org/jira/browse/SPARK-4111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186741#comment-14186741 ] Apache Spark commented on SPARK-4111: - User 'yanbohappy' has created a pull request for this issue: https://github.com/apache/spark/pull/2978 > [MLlib] Implement regression model evaluation metrics > - > > Key: SPARK-4111 > URL: https://issues.apache.org/jira/browse/SPARK-4111 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Yanbo Liang > > Supervised machine learning include classification and regression. There is > classification metrics (BinaryClassificationMetrics) in MLlib, we also need > regression metrics to evaluate the regression model and tunning parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4116) Delete the abandoned log4j-spark-container.properties
[ https://issues.apache.org/jira/browse/SPARK-4116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-4116. -- Resolution: Fixed Fix Version/s: 1.2.0 > Delete the abandoned log4j-spark-container.properties > - > > Key: SPARK-4116 > URL: https://issues.apache.org/jira/browse/SPARK-4116 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: WangTaoTheTonic >Priority: Minor > Fix For: 1.2.0 > > > Seems like the properties file was abandoned, we could delete it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4095) [YARN][Minor]extract val isLaunchingDriver in ClientBase
[ https://issues.apache.org/jira/browse/SPARK-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-4095. -- Resolution: Fixed Fix Version/s: 1.2.0 > [YARN][Minor]extract val isLaunchingDriver in ClientBase > > > Key: SPARK-4095 > URL: https://issues.apache.org/jira/browse/SPARK-4095 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: WangTaoTheTonic >Priority: Minor > Fix For: 1.2.0 > > > Instead of checking if `args.userClass` is null repeatedly, we extract it to > an global val as in `ApplicationMaster`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3837) Warn when YARN is killing containers for exceeding memory limits
[ https://issues.apache.org/jira/browse/SPARK-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186850#comment-14186850 ] Apache Spark commented on SPARK-3837: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/2744 > Warn when YARN is killing containers for exceeding memory limits > > > Key: SPARK-3837 > URL: https://issues.apache.org/jira/browse/SPARK-3837 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.1.0 >Reporter: Sandy Ryza > > YARN now lets application masters know when it kills their containers for > exceeding memory limits. Spark should log something when this happens. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4117) Spark on Yarn handle AM being told command from RM
Thomas Graves created SPARK-4117: Summary: Spark on Yarn handle AM being told command from RM Key: SPARK-4117 URL: https://issues.apache.org/jira/browse/SPARK-4117 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves In the allocateResponse from the RM it can send commands that the AM should follow. for instance AM_RESYNC and AM_SHUTDOWN. We should add support for those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4108) Fix uses of @deprecated in catalyst dataTypes
[ https://issues.apache.org/jira/browse/SPARK-4108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anant Daksh Asthana updated SPARK-4108: --- Component/s: SQL > Fix uses of @deprecated in catalyst dataTypes > - > > Key: SPARK-4108 > URL: https://issues.apache.org/jira/browse/SPARK-4108 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Anant Daksh Asthana >Priority: Trivial > > @deprecated takes 2 parameters message and version > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala > has a usage of @deprecated with just one parameter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3611) Show number of cores for each executor in application web UI
[ https://issues.apache.org/jira/browse/SPARK-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186909#comment-14186909 ] Apache Spark commented on SPARK-3611: - User 'devldevelopment' has created a pull request for this issue: https://github.com/apache/spark/pull/2980 > Show number of cores for each executor in application web UI > > > Key: SPARK-3611 > URL: https://issues.apache.org/jira/browse/SPARK-3611 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Matei Zaharia >Priority: Minor > Labels: starter > > This number is not always fully known, because e.g. in Mesos your executors > can scale up and down in # of CPUs, but it would be nice to show at least the > number of cores the machine has in that case, or the # of cores the executor > has been configured with if known. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186910#comment-14186910 ] zzc commented on SPARK-2468: Hi,Reynold Xin, What time does this issue can be solved? I need to improve shuffle performance as soon as possible. > Netty-based block server / client module > > > Key: SPARK-2468 > URL: https://issues.apache.org/jira/browse/SPARK-2468 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > Right now shuffle send goes through the block manager. This is inefficient > because it requires loading a block from disk into a kernel buffer, then into > a user space buffer, and then back to a kernel send buffer before it reaches > the NIC. It does multiple copies of the data and context switching between > kernel/user. It also creates unnecessary buffer in the JVM that increases GC > Instead, we should use FileChannel.transferTo, which handles this in the > kernel space with zero-copy. See > http://www.ibm.com/developerworks/library/j-zerocopy/ > One potential solution is to use Netty. Spark already has a Netty based > network module implemented (org.apache.spark.network.netty). However, it > lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4098) use appUIAddress instead of appUIHostPort in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-4098. -- Resolution: Fixed Fix Version/s: 1.2.0 > use appUIAddress instead of appUIHostPort in yarn-client mode > - > > Key: SPARK-4098 > URL: https://issues.apache.org/jira/browse/SPARK-4098 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: WangTaoTheTonic >Priority: Minor > Fix For: 1.2.0 > > > I don't understant why using appUIHostPort here, but in yarn-cluster mode we > use appUIAddress. So I replaced it. > Testing results state it is ok to change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186954#comment-14186954 ] Kaushik Ranjan commented on SPARK-2335: --- Hi [~bgawalt]. For evaluation of KNN-join, one needs to calculate z-scores of data-points within the dataset. Yu-ISHIKAWA has implemented the following https://gist.github.com/yu-iskw/37ae208c530f7018e048 Will it be justified to put up a NewFeature Issue to address z-scores? > k-Nearest Neighbor classification and regression for MLLib > -- > > Key: SPARK-2335 > URL: https://issues.apache.org/jira/browse/SPARK-2335 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: features, newbie > > The k-Nearest Neighbor model for classification and regression problems is a > simple and intuitive approach, offering a straightforward path to creating > non-linear decision/estimation contours. It's downsides -- high variance > (sensitivity to the known training data set) and computational intensity for > estimating new point labels -- both play to Spark's big data strengths: lots > of data mitigates data concerns; lots of workers mitigate computational > latency. > We should include kNN models as options in MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Allow for pluggable execution contexts in Spark
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186987#comment-14186987 ] Oleg Zhurakousky commented on SPARK-3561: - [~vanzin] I would not call it hard (we have done it in the initial POC by simply mixing custom trait into SC - essentially extending it), however I do agree that a lot of Spark's initialization would still happen due to the implementation of SC itself thus creating and initializing some of the artifacts that may not be used with different execution context. Question; Why was it done like this and not pushed into some SC.init operation? > Allow for pluggable execution contexts in Spark > --- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: SPARK-3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@Experimental) not exposed to end users of Spark. > The trait will define 6 operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > * persist > * unpersist > Each method directly maps to the corresponding methods in current version of > SparkContext. JobExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > DefaultExecutionContext by either implementing it from scratch or extending > form DefaultExecutionContext. > Please see the attached design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3657) yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl starting spark-shell
[ https://issues.apache.org/jira/browse/SPARK-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187070#comment-14187070 ] Apache Spark commented on SPARK-3657: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2981 > yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl > starting spark-shell > --- > > Key: SPARK-3657 > URL: https://issues.apache.org/jira/browse/SPARK-3657 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Thomas Graves >Priority: Blocker > > YarnRMClientImpl.registerApplicationMaster can throw null pointer exception > when setting the trackingurl if its empty: > appMasterRequest.setTrackingUrl(new URI(uiAddress).getAuthority()) > I hit this just start spark-shell without the tracking url set. > 14/09/23 16:18:34 INFO yarn.YarnRMClientImpl: Connecting to ResourceManager > at kryptonitered-jt1.red.ygrid.yahoo.com/98.139.154.99:8030 > Exception in thread "main" java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:710) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterRequestPBImpl.setTrackingUrl(RegisterApplicationMasterRequestPBImpl.java:132) > at > org.apache.spark.deploy.yarn.YarnRMClientImpl.registerApplicationMaster(YarnRMClientImpl.scala:102) > at > org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:55) > at > org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:38) > at > org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:168) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:206) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:120) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4031) Read broadcast variables on use
[ https://issues.apache.org/jira/browse/SPARK-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-4031. -- Resolution: Fixed Fix Version/s: 1.2.0 > Read broadcast variables on use > --- > > Key: SPARK-4031 > URL: https://issues.apache.org/jira/browse/SPARK-4031 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman > Fix For: 1.2.0 > > > This is a proposal to change the broadcast variable implementations in Spark > to only read values when they are used rather than on deserializing. > This change will be very helpful (and in our use cases required) for complex > applications which have a large number of broadcast variables. For example if > broadcast variables are class members, they are captured in closures even > when they are not used. > We could also consider cleaning closures more aggressively, but that might be > a more complex change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4031) Read broadcast variables on use
[ https://issues.apache.org/jira/browse/SPARK-4031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187122#comment-14187122 ] Shivaram Venkataraman commented on SPARK-4031: -- Issue resolved by pull request 2871 https://github.com/apache/spark/pull/2871 > Read broadcast variables on use > --- > > Key: SPARK-4031 > URL: https://issues.apache.org/jira/browse/SPARK-4031 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman > Fix For: 1.2.0 > > > This is a proposal to change the broadcast variable implementations in Spark > to only read values when they are used rather than on deserializing. > This change will be very helpful (and in our use cases required) for complex > applications which have a large number of broadcast variables. For example if > broadcast variables are class members, they are captured in closures even > when they are not used. > We could also consider cleaning closures more aggressively, but that might be > a more complex change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4118) Create python bindings for Streaming KMeans
Anant Daksh Asthana created SPARK-4118: -- Summary: Create python bindings for Streaming KMeans Key: SPARK-4118 URL: https://issues.apache.org/jira/browse/SPARK-4118 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Anant Daksh Asthana Priority: Minor Create Python bindings for Streaming K-means This is in reference to https://issues.apache.org/jira/browse/SPARK-3254 which adds Streaming K-means functionality to MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4119) Don't rely on HIVE_DEV_HOME to find .q files
Cheng Lian created SPARK-4119: - Summary: Don't rely on HIVE_DEV_HOME to find .q files Key: SPARK-4119 URL: https://issues.apache.org/jira/browse/SPARK-4119 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.1.1 Reporter: Cheng Lian Priority: Minor After merging in Hive 0.13.1 support, a bunch of .q files and golden answer files got updated. Unfortunately, some .q were updated in Hive. For example, an ORDER BY clause was added to groupby1_limit.q for bug fix. With HIVE_DEV_HOME set, developers working on Hive 0.12.0 may end up with false test failures. Because .q files are looked up from HIVE_DEV_HOME and outdated .q files are used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187183#comment-14187183 ] Reynold Xin commented on SPARK-2468: Scheduled to go in in 1.2. > Netty-based block server / client module > > > Key: SPARK-2468 > URL: https://issues.apache.org/jira/browse/SPARK-2468 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > > Right now shuffle send goes through the block manager. This is inefficient > because it requires loading a block from disk into a kernel buffer, then into > a user space buffer, and then back to a kernel send buffer before it reaches > the NIC. It does multiple copies of the data and context switching between > kernel/user. It also creates unnecessary buffer in the JVM that increases GC > Instead, we should use FileChannel.transferTo, which handles this in the > kernel space with zero-copy. See > http://www.ibm.com/developerworks/library/j-zerocopy/ > One potential solution is to use Netty. Spark already has a Netty based > network module implemented (org.apache.spark.network.netty). However, it > lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4120) Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL
Ravindra Pesala created SPARK-4120: -- Summary: Join of multiple tables with syntax like SELECT .. FROM T1,T2,T3.. does not work in SparkSQL Key: SPARK-4120 URL: https://issues.apache.org/jira/browse/SPARK-4120 Project: Spark Issue Type: Bug Components: SQL Reporter: Ravindra Pesala Fix For: 1.2.0 The queries with more than like 2 tables does not work. {code} sql("SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key") {code} The above query gives following exception. {code} Exception in thread "main" java.lang.RuntimeException: [1.40] failure: ``UNION'' expected but `,' found SELECT * FROM records1 as a,records2 as b,records3 as c where a.key=b.key and a.key=c.key ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(SparkSQLParser.scala:33) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:75) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3611) Show number of cores for each executor in application web UI
[ https://issues.apache.org/jira/browse/SPARK-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3611: - Affects Version/s: 1.0.0 > Show number of cores for each executor in application web UI > > > Key: SPARK-3611 > URL: https://issues.apache.org/jira/browse/SPARK-3611 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Matei Zaharia >Priority: Minor > Labels: starter > > This number is not always fully known, because e.g. in Mesos your executors > can scale up and down in # of CPUs, but it would be nice to show at least the > number of cores the machine has in that case, or the # of cores the executor > has been configured with if known. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4110) Wrong comments about default settings in spark-daemon.sh
[ https://issues.apache.org/jira/browse/SPARK-4110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4110. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Kousuke Saruta Target Version/s: 1.1.1, 1.2.0 (was: 1.2.0) > Wrong comments about default settings in spark-daemon.sh > > > Key: SPARK-4110 > URL: https://issues.apache.org/jira/browse/SPARK-4110 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 1.1.1, 1.2.0 > > > In spark-daemon.sh, thare are following comments. > {code} > # SPARK_CONF_DIR Alternate conf dir. Default is ${SPARK_PREFIX}/conf. > # SPARK_LOG_DIR Where log files are stored. PWD by default. > {code} > But, I think the default value for SPARK_CONF_DIR is ${SPARK_HOME}/conf and > for SPARK_LOG_DIR is ${SPARK_HOME}/logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4107) Incorrect handling of Channel.read()'s return value may lead to data truncation
[ https://issues.apache.org/jira/browse/SPARK-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-4107. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 > Incorrect handling of Channel.read()'s return value may lead to data > truncation > --- > > Key: SPARK-4107 > URL: https://issues.apache.org/jira/browse/SPARK-4107 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0, 1.1.1, 1.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.1.1, 1.2.0 > > > When using {{Channel.read()}}, we need to properly handle the return value > and account for the case where we've read fewer bytes than expected. There > are a few places where we don't do this properly, which may lead to incorrect > data truncation in rare circumstances. I've opened a PR to fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4096) let ApplicationMaster accept executor memory argument in same format as JVM memory strings
[ https://issues.apache.org/jira/browse/SPARK-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4096: - Affects Version/s: 1.1.0 > let ApplicationMaster accept executor memory argument in same format as JVM > memory strings > -- > > Key: SPARK-4096 > URL: https://issues.apache.org/jira/browse/SPARK-4096 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.1.0 >Reporter: WangTaoTheTonic >Priority: Minor > Fix For: 1.2.0 > > > Here ApplicationMaster accept executor memory argument only in number format, > we should let it accept JVM style memory strings as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4096) let ApplicationMaster accept executor memory argument in same format as JVM memory strings
[ https://issues.apache.org/jira/browse/SPARK-4096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4096. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: WangTaoTheTonic Target Version/s: 1.2.0 > let ApplicationMaster accept executor memory argument in same format as JVM > memory strings > -- > > Key: SPARK-4096 > URL: https://issues.apache.org/jira/browse/SPARK-4096 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.1.0 >Reporter: WangTaoTheTonic >Assignee: WangTaoTheTonic >Priority: Minor > Fix For: 1.2.0 > > > Here ApplicationMaster accept executor memory argument only in number format, > we should let it accept JVM style memory strings as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3657) yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl starting spark-shell
[ https://issues.apache.org/jira/browse/SPARK-3657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-3657. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Kousuke Saruta > yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl > starting spark-shell > --- > > Key: SPARK-3657 > URL: https://issues.apache.org/jira/browse/SPARK-3657 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Thomas Graves >Assignee: Kousuke Saruta >Priority: Blocker > Fix For: 1.2.0 > > > YarnRMClientImpl.registerApplicationMaster can throw null pointer exception > when setting the trackingurl if its empty: > appMasterRequest.setTrackingUrl(new URI(uiAddress).getAuthority()) > I hit this just start spark-shell without the tracking url set. > 14/09/23 16:18:34 INFO yarn.YarnRMClientImpl: Connecting to ResourceManager > at kryptonitered-jt1.red.ygrid.yahoo.com/98.139.154.99:8030 > Exception in thread "main" java.lang.NullPointerException > at > org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:710) > at > org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterRequestPBImpl.setTrackingUrl(RegisterApplicationMasterRequestPBImpl.java:132) > at > org.apache.spark.deploy.yarn.YarnRMClientImpl.registerApplicationMaster(YarnRMClientImpl.scala:102) > at > org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:55) > at > org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:38) > at > org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:168) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:206) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:120) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4089) The version number of Spark in _config.yaml is wrong.
[ https://issues.apache.org/jira/browse/SPARK-4089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4089. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Kousuke Saruta > The version number of Spark in _config.yaml is wrong. > - > > Key: SPARK-4089 > URL: https://issues.apache.org/jira/browse/SPARK-4089 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 1.2.0 > > > The version number of Spark in docs/_config.yaml for master branch should be > 1.2.0 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4065) pyspark will not use ipython on Windows
[ https://issues.apache.org/jira/browse/SPARK-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4065. Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Target Version/s: 1.1.1, 1.2.0 > pyspark will not use ipython on Windows > --- > > Key: SPARK-4065 > URL: https://issues.apache.org/jira/browse/SPARK-4065 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 >Reporter: Michael Griffiths >Assignee: Michael Griffiths >Priority: Minor > Fix For: 1.1.1, 1.2.0 > > > pyspark2.cmd will not launch ipython, even if the environment variables are > set. It doesn't check for the existence of ipython environment variables - in > all cases, it will just launch python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4065) pyspark will not use ipython on Windows
[ https://issues.apache.org/jira/browse/SPARK-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-4065: - Assignee: Michael Griffiths > pyspark will not use ipython on Windows > --- > > Key: SPARK-4065 > URL: https://issues.apache.org/jira/browse/SPARK-4065 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 >Reporter: Michael Griffiths >Assignee: Michael Griffiths >Priority: Minor > Fix For: 1.1.1, 1.2.0 > > > pyspark2.cmd will not launch ipython, even if the environment variables are > set. It doesn't check for the existence of ipython environment variables - in > all cases, it will just launch python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4048) Enhance and extend hadoop-provided profile
[ https://issues.apache.org/jira/browse/SPARK-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187355#comment-14187355 ] Apache Spark commented on SPARK-4048: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/2982 > Enhance and extend hadoop-provided profile > -- > > Key: SPARK-4048 > URL: https://issues.apache.org/jira/browse/SPARK-4048 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > > The hadoop-provided profile is used to not package Hadoop dependencies inside > the Spark assembly. It works, sort of, but it could use some enhancements. A > quick list: > - It doesn't include all things that could be removed from the assembly > - It doesn't work well when you're publishing artifacts based on it > (SPARK-3812 fixes this) > - There are other dependencies that could use similar treatment: Hive, HBase > (for the examples), Flume, Parquet, maybe others I'm missing at the moment. > - Unit tests, more specifically, those that use local-cluster mode, do not > work when the assembly is built with this profile enabled. > - The scripts to launch Spark jobs do not add needed "provided" jars to the > classpath when this profile is enabled, leaving it for people to figure that > out for themselves. > - The examples assembly duplicates a lot of things in the main assembly. > Part of this task is selfish since we build internally with this profile and > we'd like to make it easier for us to merge changes without having to keep > too many patches on top of upstream. But those feel like good improvements to > me, regardless. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4058) Log file name is hard coded even though there is a variable '$LOG_FILE '
[ https://issues.apache.org/jira/browse/SPARK-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4058. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Kousuke Saruta > Log file name is hard coded even though there is a variable '$LOG_FILE ' > > > Key: SPARK-4058 > URL: https://issues.apache.org/jira/browse/SPARK-4058 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 1.2.0 > > > In a script 'python/run-tests', log file name is represented by a variable > 'LOG_FILE' and it is used in run-tests. But, there are some hard-coded log > file name in the script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3814) Support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL
[ https://issues.apache.org/jira/browse/SPARK-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3814. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2961 [https://github.com/apache/spark/pull/2961] > Support for Bitwise AND(&), OR(|) ,XOR(^), NOT(~) in Spark HQL and SQL > -- > > Key: SPARK-3814 > URL: https://issues.apache.org/jira/browse/SPARK-3814 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yana Kadiyska >Assignee: Ravindra Pesala >Priority: Minor > Fix For: 1.2.0 > > > Error: java.lang.RuntimeException: > Unsupported language features in query: select (case when bit_field & 1=1 > then r_end - r_start else NULL end) from mytable where pkey='0178-2014-07' > LIMIT 2 > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME >mytable > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_FUNCTION > when > = > & > TOK_TABLE_OR_COL > bit_field > 1 > 1 > - > TOK_TABLE_OR_COL > r_end > TOK_TABLE_OR_COL > r_start > TOK_NULL > TOK_WHERE > = > TOK_TABLE_OR_COL > pkey > '0178-2014-07' > TOK_LIMIT > 2 > SQLState: null > ErrorCode: 0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3988) Public API for DateType support
[ https://issues.apache.org/jira/browse/SPARK-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3988. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2901 [https://github.com/apache/spark/pull/2901] > Public API for DateType support > --- > > Key: SPARK-3988 > URL: https://issues.apache.org/jira/browse/SPARK-3988 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > Fix For: 1.2.0 > > > add Python API and something else. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4121) Master build failures after shading commons-math3
Xiangrui Meng created SPARK-4121: Summary: Master build failures after shading commons-math3 Key: SPARK-4121 URL: https://issues.apache.org/jira/browse/SPARK-4121 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Xiangrui Meng Priority: Blocker The Spark master Maven build kept failing after we replace colt with commons-math3 and shade the later: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/ The error message is: {code} KMeansClusterSuite: Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark assembly has been built with Hive, including Datanucleus jars on classpath - task size should be small in both training and prediction *** FAILED *** org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 9, localhost): java.io.InvalidClassException: org.apache.spark.util.random.PoissonSampler; local class incompatible: stream classdesc serialVersionUID = -795011761847245121, local class serialVersionUID = 424924496318419 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} This test passed with local sbt build. So it should be caused by shading. Maybe there are two versions of commons-math3 (hadoop depends on it), or MLlib doesn't use the shaded version at compile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4121) Master build failures after shading commons-math3
[ https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4121: - Component/s: MLlib Build > Master build failures after shading commons-math3 > - > > Key: SPARK-4121 > URL: https://issues.apache.org/jira/browse/SPARK-4121 > Project: Spark > Issue Type: Bug > Components: Build, MLlib, Spark Core >Affects Versions: 1.2.0 >Reporter: Xiangrui Meng >Priority: Blocker > > The Spark master Maven build kept failing after we replace colt with > commons-math3 and shade the later: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/ > The error message is: > {code} > KMeansClusterSuite: > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > - task size should be small in both training and prediction *** FAILED *** > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 > (TID 9, localhost): java.io.InvalidClassException: > org.apache.spark.util.random.PoissonSampler; local class incompatible: stream > classdesc serialVersionUID = -795011761847245121, local class > serialVersionUID = 424924496318419 > java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) > > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) > org.apache.spark.scheduler.Task.run(Task.scala:56) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} > This test passed with local sbt build. So it should be caused by shading. > Maybe there are two versions of commons-math3 (hadoop depends on it), or > MLlib doesn't use the shaded version at compile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4121) Master build failures after shading commons-math3
[ https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4121: - Description: The Spark master Maven build kept failing after we replace colt with commons-math3 and shade the latter: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/ The error message is: {code} KMeansClusterSuite: Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark assembly has been built with Hive, including Datanucleus jars on classpath - task size should be small in both training and prediction *** FAILED *** org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 9, localhost): java.io.InvalidClassException: org.apache.spark.util.random.PoissonSampler; local class incompatible: stream classdesc serialVersionUID = -795011761847245121, local class serialVersionUID = 424924496318419 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} This test passed in local sbt build. So the issue should be caused by shading. Maybe there are two versions of commons-math3 (hadoop depends on it), or MLlib doesn't use the shaded version at compile. [~srowen] Could you take a look? Thanks! was: The Spark master Maven build kept failing after we replace colt with commons-math3 and shade the later: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/ The error message is: {code} KMeansClusterSuite: Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark assembly has been built with Hive, including Datanucleus jars on classpath - task size should be small in both training and prediction *** FAILED *** org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 9, localhost): java.io.InvalidClassException: org.apache.spark.util.random.PoissonSampler; local class incompatible: stream classdesc serialVersionUID = -795011761847245121, local class serialVersionUID = 424924496318419 java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) java.io.ObjectInputStream.readObject(ObjectInputStre
[jira] [Commented] (SPARK-4121) Master build failures after shading commons-math3
[ https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187453#comment-14187453 ] Patrick Wendell commented on SPARK-4121: [~srowen] - can you help with this? This is likely happening because the PoissonSampler on the driver is using the classpath from Maven (with the unmodified version of PoissonSampler) and the executors are using the version from the assembly jar, which has package relocations of the commons math dependency in the byte code. This is a test that uses "local-cluster" mode. Is there a reason we are doing these relocations in the assembly only? Would it be better to actually shade-and-inline commons-math in both the spark-core and spark-mllib package jars? Having discrepancies between the assmebly and package jars I'm guessing could lead to problems other than just this test issue. It also means that applications which compile against Spark's dependencies rather than running through the Spark assembly packages won't get the benefit of the shading we've done. > Master build failures after shading commons-math3 > - > > Key: SPARK-4121 > URL: https://issues.apache.org/jira/browse/SPARK-4121 > Project: Spark > Issue Type: Bug > Components: Build, MLlib, Spark Core >Affects Versions: 1.2.0 >Reporter: Xiangrui Meng >Priority: Blocker > > The Spark master Maven build kept failing after we replace colt with > commons-math3 and shade the latter: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/ > The error message is: > {code} > KMeansClusterSuite: > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > - task size should be small in both training and prediction *** FAILED *** > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 > (TID 9, localhost): java.io.InvalidClassException: > org.apache.spark.util.random.PoissonSampler; local class incompatible: stream > classdesc serialVersionUID = -795011761847245121, local class > serialVersionUID = 424924496318419 > java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) > > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) > org.apache.spark.scheduler.Task.run(Task.scala:56) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} > This test passed in local sbt build. So the issue should be caused by > shading. Maybe there are two versions of commons-math3 (hadoop depends on > it), or MLlib doesn't use the shaded version at compile. > [~srowen] Could you take a look? Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4121) Master build failures after shading commons-math3
[ https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187458#comment-14187458 ] Josh Rosen commented on SPARK-4121: --- Here's an easy command to reproduce this: {code} mvn -DskipTests package mvn test -DwildcardSuites=org.apache.spark.mllib.clustering.KMeansClusterSuite {code} > Master build failures after shading commons-math3 > - > > Key: SPARK-4121 > URL: https://issues.apache.org/jira/browse/SPARK-4121 > Project: Spark > Issue Type: Bug > Components: Build, MLlib, Spark Core >Affects Versions: 1.2.0 >Reporter: Xiangrui Meng >Priority: Blocker > > The Spark master Maven build kept failing after we replace colt with > commons-math3 and shade the latter: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/ > The error message is: > {code} > KMeansClusterSuite: > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > - task size should be small in both training and prediction *** FAILED *** > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 > (TID 9, localhost): java.io.InvalidClassException: > org.apache.spark.util.random.PoissonSampler; local class incompatible: stream > classdesc serialVersionUID = -795011761847245121, local class > serialVersionUID = 424924496318419 > java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) > > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) > org.apache.spark.scheduler.Task.run(Task.scala:56) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} > This test passed in local sbt build. So the issue should be caused by > shading. Maybe there are two versions of commons-math3 (hadoop depends on > it), or MLlib doesn't use the shaded version at compile. > [~srowen] Could you take a look? Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4122) Add library to write data back to Kafka
Hari Shreedharan created SPARK-4122: --- Summary: Add library to write data back to Kafka Key: SPARK-4122 URL: https://issues.apache.org/jira/browse/SPARK-4122 Project: Spark Issue Type: Bug Reporter: Hari Shreedharan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187471#comment-14187471 ] Michael Griffiths commented on SPARK-3398: -- I'm running into an issue with {{wait_for_cluster_state}} - specifically, waiting {{for ssh-ready}}. AFAICT the [valid states in boto are|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.instance.InstanceState]: * pending * running * shutting-down * terminated * stopping * stopped When I invoke spark_ec2.py, it never moves to the next stage (infinite loop). Is {{ssh-ready}} a state in a different version of boto? Thanks, Michael > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187471#comment-14187471 ] Michael Griffiths edited comment on SPARK-3398 at 10/28/14 9:10 PM: I'm running into an issue with {{wait_for_cluster_state}} - specifically, waiting for {{ssh-ready}}. AFAICT the [valid states in boto are|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.instance.InstanceState]: * pending * running * shutting-down * terminated * stopping * stopped When I invoke spark_ec2.py, it never moves to the next stage (infinite loop). Is {{ssh-ready}} a state in a different version of boto? Thanks, Michael was (Author: michael.griffiths): I'm running into an issue with {{wait_for_cluster_state}} - specifically, waiting {{for ssh-ready}}. AFAICT the [valid states in boto are|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.instance.InstanceState]: * pending * running * shutting-down * terminated * stopping * stopped When I invoke spark_ec2.py, it never moves to the next stage (infinite loop). Is {{ssh-ready}} a state in a different version of boto? Thanks, Michael > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4121) Master build failures after shading commons-math3
[ https://issues.apache.org/jira/browse/SPARK-4121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187475#comment-14187475 ] Sean Owen commented on SPARK-4121: -- Yeah I was seeing this locally, but not on the Jenkins test build, so chalked it up to weirdness in my build. I think the answer may indeed be to do the relocating in core/mllib itself. I'll get on that. > Master build failures after shading commons-math3 > - > > Key: SPARK-4121 > URL: https://issues.apache.org/jira/browse/SPARK-4121 > Project: Spark > Issue Type: Bug > Components: Build, MLlib, Spark Core >Affects Versions: 1.2.0 >Reporter: Xiangrui Meng >Priority: Blocker > > The Spark master Maven build kept failing after we replace colt with > commons-math3 and shade the latter: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/ > The error message is: > {code} > KMeansClusterSuite: > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > Spark assembly has been built with Hive, including Datanucleus jars on > classpath > - task size should be small in both training and prediction *** FAILED *** > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 > (TID 9, localhost): java.io.InvalidClassException: > org.apache.spark.util.random.PoissonSampler; local class incompatible: stream > classdesc serialVersionUID = -795011761847245121, local class > serialVersionUID = 424924496318419 > java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) > > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) > org.apache.spark.scheduler.Task.run(Task.scala:56) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} > This test passed in local sbt build. So the issue should be caused by > shading. Maybe there are two versions of commons-math3 (hadoop depends on > it), or MLlib doesn't use the shaded version at compile. > [~srowen] Could you take a look? Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3922) A global UTF8 constant for Spark
[ https://issues.apache.org/jira/browse/SPARK-3922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3922. Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Shixiong Zhu > A global UTF8 constant for Spark > > > Key: SPARK-3922 > URL: https://issues.apache.org/jira/browse/SPARK-3922 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 1.2.0 > > > A global UTF8 constant is very helpful to handle encoding problems when > converting between String and bytes. There are several solutions here: > 1. Add `val UTF_8 = Charset.forName("UTF-8")` to Utils.scala > 2. java.nio.charset.StandardCharsets.UTF_8 (require JDK7) > 3. io.netty.util.CharsetUtil.UTF_8 > 4. com.google.common.base.Charsets.UTF_8 > 5. org.apache.commons.lang.CharEncoding.UTF_8 > 6. org.apache.commons.lang3.CharEncoding.UTF_8 > IMO, I prefer option 1) because people can find it easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3343) Support for CREATE TABLE AS SELECT that specifies the format
[ https://issues.apache.org/jira/browse/SPARK-3343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3343. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2570 [https://github.com/apache/spark/pull/2570] > Support for CREATE TABLE AS SELECT that specifies the format > > > Key: SPARK-3343 > URL: https://issues.apache.org/jira/browse/SPARK-3343 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: HuQizhong > Fix For: 1.2.0 > > > hql("""CREATE TABLE tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS > TERMINATED BY ',' LINES TERMINATED BY '\n' as SELECT SUM(uv) as uv, > round(SUM(cost),2) as total, round(SUM(cost)/SUM(uv),2) FROM > tmp_adclick_sellplat """) > 14/09/02 15:32:28 INFO ParseDriver: Parse Completed > java.lang.RuntimeException: > Unsupported language features in query: CREATE TABLE > tmp_adclick_gm_all ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES > TERMINATED BY 'abc' as SELECT SUM(uv) as uv, round(SUM(cost),2) as > total, round(SUM(cost)/SUM(uv),2) FROM tmp_adclick_sellplat > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:255) > at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:75) > at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:78) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2335) k-Nearest Neighbor classification and regression for MLLib
[ https://issues.apache.org/jira/browse/SPARK-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Gawalt updated SPARK-2335: Labels: features (was: features newbie) > k-Nearest Neighbor classification and regression for MLLib > -- > > Key: SPARK-2335 > URL: https://issues.apache.org/jira/browse/SPARK-2335 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Brian Gawalt >Priority: Minor > Labels: features > > The k-Nearest Neighbor model for classification and regression problems is a > simple and intuitive approach, offering a straightforward path to creating > non-linear decision/estimation contours. It's downsides -- high variance > (sensitivity to the known training data set) and computational intensity for > estimating new point labels -- both play to Spark's big data strengths: lots > of data mitigates data concerns; lots of workers mitigate computational > latency. > We should include kNN models as options in MLLib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187519#comment-14187519 ] Nicholas Chammas commented on SPARK-3398: - [~michael.griffiths] - [{{wait_for_cluster_state}}|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark_ec2.py#L634] will take any of the valid boto states, plus {{ssh-ready}}. {{ssh-ready}} is not a boto state, but rather a handy label for a relevant state that we want to wait for. {{spark-ec2}} manually checks for this state by testing SSH availability on each of the nodes in the cluster. How are you invoking {{spark-ec2}}? Sometimes instances can take a few minutes before SSH becomes available. How long have you waited? > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187537#comment-14187537 ] Michael Armbrust commented on SPARK-3683: - [~davies] there used to be some explicit code that checked for "NULL", but I can't find it anymore, so you are right this problem might exist in scala too. However, I can't reproduce it as most serdes seem to store null as "\N". Some sample code to reproduce the issue would be helpful. > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187543#comment-14187543 ] Michael Griffiths commented on SPARK-3398: -- I waited until all the servers (11) were up according to AWS Console, then ran the command again with --resume. After that, I waited 10 minutes. Then I went in, changed the check to "running", and it worked fine. I'll check my setup (invoking on an Ubuntu server). It's certainly possible there's something wrong there. > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187553#comment-14187553 ] Nicholas Chammas commented on SPARK-3398: - Hmm, I'm curious: # Why did you have to run {{spark-ec2}} again with {{--resume}}? # Are you using an AMI other than the standard one? # If yes, do you know what shell that AMI defaults to? What does {{true ; echo $?}} return on that shell? > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187563#comment-14187563 ] Davies Liu commented on SPARK-3683: --- This commit remove the special case for "NULL": {code} commit cf989601d0e784e1c3507720e64636891fe28292 Author: Cheng Lian Date: Fri May 30 22:13:11 2014 -0700 [SPARK-1959] String "NULL" shouldn't be interpreted as null value JIRA issue: [SPARK-1959](https://issues.apache.org/jira/browse/SPARK-1959) Author: Cheng Lian Closes #909 from liancheng/spark-1959 and squashes the following commits: 306659c [Cheng Lian] [SPARK-1959] String "NULL" shouldn't be interpreted as null value diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala index f141139..d263c31 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveOperators.scala @@ -113,7 +113,6 @@ case class HiveTableScan( } private def unwrapHiveData(value: Any) = value match { -case maybeNull: String if maybeNull.toLowerCase == "null" => null case varchar: HiveVarchar => varchar.getValue case decimal: HiveDecimal => BigDecimal(decimal.bigDecimalValue) case other => other {code} So this should be a bug from Hive. > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187566#comment-14187566 ] Michael Griffiths commented on SPARK-3398: -- In order - # I tried a few times; it kept failing. Ultimately I ran it once to setup the instances, and then waited to ensure I could SSH into the manually before running again. # No, I'm using the default AMI. The only parameters I'm passing are the SSH keyname, the key file, and cluster name. # {{true ; echo $?}} returns 0. > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3683) PySpark Hive query generates "NULL" instead of None
[ https://issues.apache.org/jira/browse/SPARK-3683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187570#comment-14187570 ] Michael Armbrust commented on SPARK-3683: - Good find! [~jamborta] if you can give code that shows Hive does interpret this as null I'd consider adding it back for compatibility otherwise it seems like this is expected behavior. /cc [~liancheng] > PySpark Hive query generates "NULL" instead of None > --- > > Key: SPARK-3683 > URL: https://issues.apache.org/jira/browse/SPARK-3683 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.1.0 >Reporter: Tamas Jambor >Assignee: Davies Liu > > When I run a Hive query in Spark SQL, I get the new Row object, where it does > not convert Hive NULL into Python None instead it keeps it string 'NULL'. > It's only an issue with String type, works with other types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4123) Show new dependencies added in pull requests
Patrick Wendell created SPARK-4123: -- Summary: Show new dependencies added in pull requests Key: SPARK-4123 URL: https://issues.apache.org/jira/browse/SPARK-4123 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Priority: Critical We should inspect the classpath of Spark's assembly jar for every pull request. This only takes a few seconds in Maven and it will help weed out dependency changes from the master branch. Ideally we'd post any dependency changes in the pull request message. {code} $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath git checkout apache/master $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath $ diff my-classpath master-classpath < chill-java-0.3.6.jar < chill_2.10-0.3.6.jar --- > chill-java-0.5.0.jar > chill_2.10-0.5.0.jar {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4123) Show new dependencies added in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4123: --- Description: We should inspect the classpath of Spark's assembly jar for every pull request. This only takes a few seconds in Maven and it will help weed out dependency changes from the master branch. Ideally we'd post any dependency changes in the pull request message. {code} $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath $ git checkout apache/master $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath $ diff my-classpath master-classpath < chill-java-0.3.6.jar < chill_2.10-0.3.6.jar --- > chill-java-0.5.0.jar > chill_2.10-0.5.0.jar {code} was: We should inspect the classpath of Spark's assembly jar for every pull request. This only takes a few seconds in Maven and it will help weed out dependency changes from the master branch. Ideally we'd post any dependency changes in the pull request message. {code} $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath git checkout apache/master $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath $ diff my-classpath master-classpath < chill-java-0.3.6.jar < chill_2.10-0.3.6.jar --- > chill-java-0.5.0.jar > chill_2.10-0.5.0.jar {code} > Show new dependencies added in pull requests > > > Key: SPARK-4123 > URL: https://issues.apache.org/jira/browse/SPARK-4123 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Priority: Critical > > We should inspect the classpath of Spark's assembly jar for every pull > request. This only takes a few seconds in Maven and it will help weed out > dependency changes from the master branch. Ideally we'd post any dependency > changes in the pull request message. > {code} > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath > $ git checkout apache/master > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath > $ diff my-classpath master-classpath > < chill-java-0.3.6.jar > < chill_2.10-0.3.6.jar > --- > > chill-java-0.5.0.jar > > chill_2.10-0.5.0.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4084) Reuse sort key in Sorter
[ https://issues.apache.org/jira/browse/SPARK-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-4084. --- Resolution: Fixed Fix Version/s: 1.2.0 > Reuse sort key in Sorter > > > Key: SPARK-4084 > URL: https://issues.apache.org/jira/browse/SPARK-4084 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.2.0 > > > Sorter uses generic-typed key for sorting. When data is large, it creates > lots of key objects, which is not efficient. We should reuse the key in > Sorter for memory efficiency. This change is part of the petabyte sort > implementation from [~rxin]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4123) Show new dependencies added in pull requests
[ https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187590#comment-14187590 ] Patrick Wendell commented on SPARK-4123: [~nchammas] - do you have any interest in doing this one? > Show new dependencies added in pull requests > > > Key: SPARK-4123 > URL: https://issues.apache.org/jira/browse/SPARK-4123 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Patrick Wendell >Priority: Critical > > We should inspect the classpath of Spark's assembly jar for every pull > request. This only takes a few seconds in Maven and it will help weed out > dependency changes from the master branch. Ideally we'd post any dependency > changes in the pull request message. > {code} > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath > $ git checkout apache/master > $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly | grep -v > INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath > $ diff my-classpath master-classpath > < chill-java-0.3.6.jar > < chill_2.10-0.3.6.jar > --- > > chill-java-0.5.0.jar > > chill_2.10-0.5.0.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3398) Have spark-ec2 intelligently wait for specific cluster states
[ https://issues.apache.org/jira/browse/SPARK-3398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187593#comment-14187593 ] Nicholas Chammas commented on SPARK-3398: - OK, so you're invoking {{spark-ec2}} from an Ubuntu server. I wonder if that matters any, specifically when we make [this call|https://github.com/apache/spark/blob/4b55482abf899c27da3d55401ad26b4e9247b327/ec2/spark_ec2.py#L615]. What happens if you replace the code at that line with this version? {code} ret = subprocess.check_call( ssh_command(opts) + ['-t', '-t', '-o', 'ConnectTimeout=3', '%s@%s' % (opts.user, host), stringify_command('true')] ) {code} This will just print SSH's output to the screen instead of suppressing it. If anything's going wrong, it should be more obvious that way. > Have spark-ec2 intelligently wait for specific cluster states > - > > Key: SPARK-3398 > URL: https://issues.apache.org/jira/browse/SPARK-3398 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 1.2.0 > > > {{spark-ec2}} currently has retry logic for when it tries to install stuff on > a cluster and for when it tries to destroy security groups. > It would be better to have some logic that allows {{spark-ec2}} to explicitly > wait for when all the nodes in a cluster it is working on have reached a > specific state. > Examples: > * Wait for all nodes to be up > * Wait for all nodes to be up and accepting SSH connections (then start > installing stuff) > * Wait for all nodes to be down > * Wait for all nodes to be terminated (then delete the security groups) > Having a function in the {{spark_ec2.py}} script that blocks until the > desired cluster state is reached would reduce the need for various retry > logic. It would probably also eliminate the need for the {{--wait}} parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org