[jira] [Resolved] (SPARK-2963) The description about how to build for using CLI and Thrift JDBC server is absent in proper document

2014-08-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2963.


Resolution: Fixed
  Assignee: Kousuke Saruta

Thanks - I've merged your fix.

> The description about how to build for using CLI and Thrift JDBC server is 
> absent in proper document 
> -
>
> Key: SPARK-2963
> URL: https://issues.apache.org/jira/browse/SPARK-2963
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.1.0
>
>
> Currently, if we'd like to use HiveServer or CLI for SparkSQL, we need to use 
> -Phive-thriftserver option when building but it's description is incomplete.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3170) Bug Fix in Storage UI

2014-08-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3170:
---

Priority: Critical  (was: Minor)

> Bug Fix in Storage UI
> -
>
> Key: SPARK-3170
> URL: https://issues.apache.org/jira/browse/SPARK-3170
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.2
>Reporter: uncleGen
>Priority: Critical
>
> current compeleted stage only need to remove its own partitions that are no 
> longer cached. Currently, "Storage" in Spark UI may lost some rdds which are 
> cached actually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3170) Bug Fix in Storage UI

2014-08-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107868#comment-14107868
 ] 

Patrick Wendell commented on SPARK-3170:


I think this is actually fairly important to fix if possible.

> Bug Fix in Storage UI
> -
>
> Key: SPARK-3170
> URL: https://issues.apache.org/jira/browse/SPARK-3170
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.2
>Reporter: uncleGen
>Priority: Critical
>
> current compeleted stage only need to remove its own partitions that are no 
> longer cached. Currently, "Storage" in Spark UI may lost some rdds which are 
> cached actually.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3175) Branch-1.1 SBT build failed for Yarn-Alpha

2014-08-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3175.


Resolution: Won't Fix

We have to keep these versions slightly out of sync when we are making release 
candidates due to the way that our maven publishing plug-in works. If you 
checkout the specific release snapshots though (e.g. snapshot1, rc1, etc). Then 
it will work. This issue is only relevant for the older YARN build.

> Branch-1.1 SBT build failed for Yarn-Alpha
> --
>
> Key: SPARK-3175
> URL: https://issues.apache.org/jira/browse/SPARK-3175
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.1.1
>Reporter: Chester
>  Labels: build
> Fix For: 1.1.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When trying to build yarn-alpha on branch-1.1
> áš› |branch-1.1|$  sbt/sbt -Pyarn-alpha -Dhadoop.version=2.0.5-alpha projects
> [info] Loading project definition from /Users/chester/projects/spark/project
> org.apache.maven.model.building.ModelBuildingException: 1 problem was 
> encountered while building the effective model for 
> org.apache.spark:spark-yarn-alpha_2.10:1.1.0
> [FATAL] Non-resolvable parent POM: Could not find artifact 
> org.apache.spark:yarn-parent_2.10:pom:1.1.0 in central ( 
> http://repo.maven.apache.org/maven2) and 'parent.relativePath' points at 
> wrong local POM @ line 20, column 11



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3169) make-distribution.sh failed

2014-08-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3169:
---

Assignee: Tathagata Das

> make-distribution.sh failed
> ---
>
> Key: SPARK-3169
> URL: https://issues.apache.org/jira/browse/SPARK-3169
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Guoqiang Li
>Assignee: Tathagata Das
>Priority: Blocker
> Fix For: 1.1.0
>
>
> {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive 
> -Dhadoop.version=2.3.0 
> {code}
>  =>
> {noformat}
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A 
> signature in TestSuiteBase.class refers to term dstream
> in package org.apache.spark.streaming which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling 
> TestSuiteBase.class.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3169) make-distribution.sh failed

2014-08-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3169.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 2101
[https://github.com/apache/spark/pull/2101]

> make-distribution.sh failed
> ---
>
> Key: SPARK-3169
> URL: https://issues.apache.org/jira/browse/SPARK-3169
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Guoqiang Li
>Priority: Blocker
> Fix For: 1.1.0
>
>
> {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive 
> -Dhadoop.version=2.3.0 
> {code}
>  =>
> {noformat}
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A 
> signature in TestSuiteBase.class refers to term dstream
> in package org.apache.spark.streaming which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling 
> TestSuiteBase.class.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3191) Add explanation of support building spark with maven in http proxy environment

2014-08-22 Thread zhengbing li (JIRA)
zhengbing li created SPARK-3191:
---

 Summary: Add explanation of support building spark with maven in 
http proxy environment
 Key: SPARK-3191
 URL: https://issues.apache.org/jira/browse/SPARK-3191
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.0.2
 Environment: linux suse 11
maven version:apache-maven-3.0.5
spark version: 1.0.1
proxy setting of maven is:
  
  lzb
  true
  http
  user
  password
  proxy.company.com
  8080
  *.company.com


Reporter: zhengbing li
Priority: Trivial
 Fix For: 1.1.0


When I use "mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean 
package" in http proxy enviroment, I cannot finish this task.Error is as 
follows:
[INFO] Spark Project YARN Stable API . SUCCESS [34.217s]
[INFO] Spark Project Assembly  FAILURE [43.133s]
[INFO] Spark Project External Twitter  SKIPPED
[INFO] Spark Project External Kafka .. SKIPPED
[INFO] Spark Project External Flume .. SKIPPED
[INFO] Spark Project External ZeroMQ . SKIPPED
[INFO] Spark Project External MQTT ... SKIPPED
[INFO] Spark Project Examples  SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 27:57.309s
[INFO] Finished at: Sat Aug 23 09:43:21 CST 2014
[INFO] Final Memory: 51M/1080M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project 
spark-assembly_2.10: Execution default of goal 
org.apache.maven.plugins:maven-shade-plugin:2.2:shade failed: Plugin 
org.apache.maven.plugins:maven-shade-plugin:2.2 or one of its dependencies 
could not be resolved: Could not find artifact 
com.google.code.findbugs:jsr305:jar:1.3.9 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/PluginResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-assembly_2.10


If you use this command, It is ok
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 
-Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true 
-DskipTests clean package

The error is not very obvious, I spent a long time to solve this issues
In order to facilitate other guys who use spark in http proxy environment, I 
highly recommed add this to documents



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-08-22 Thread Richard W. Eggert II (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107827#comment-14107827
 ] 

Richard W. Eggert II commented on SPARK-2707:
-

Upgrading to Akka 2.3 will also allow SparkContexts to be created within other 
applications that use Akka 2.3, especially Play 2.3 web applications. Akka 2.2 
and 2.3 appear to be binary incompatible, which means that Spark cannot 
currently be used within a Play 2.3 application.

> Upgrade to Akka 2.3
> ---
>
> Key: SPARK-2707
> URL: https://issues.apache.org/jira/browse/SPARK-2707
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Yardena
>
> Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
> features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3190) Creation of large graph(> 2.15 B nodes) seems to be broken:possible overflow somewhere

2014-08-22 Thread npanj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-3190:
-

Summary: Creation of large graph(> 2.15 B nodes) seems to be 
broken:possible overflow somewhere   (was: Creation of large graph(over 2.5B 
nodes) seems to be broken:possible overflow somewhere)

> Creation of large graph(> 2.15 B nodes) seems to be broken:possible overflow 
> somewhere 
> ---
>
> Key: SPARK-3190
> URL: https://issues.apache.org/jira/browse/SPARK-3190
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.3
> Environment: Standalone mode running on EC2 . Using latest code from 
> master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
>Reporter: npanj
>Priority: Critical
>
> While creating a graph with 6B nodes and 12B edges, I noticed that 
> 'numVertices' api returns incorrect result; 'numEdges' reports correct 
> number. For few times(with different dataset > 2.5B nodes) I have also 
> notices that numVertices is returned as -ive number; so I suspect that there 
> is some overflow (may be we are using Int for some field?).
> Here is some details of experiments  I have done so far: 
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
>Graph returns: numVertices=1807028297 ;  numEdges=12163784626
> 2. Input : numNodes=2157586441 ; noEdges=2747322705
>Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
>Graph: numVertices=1725060105 ;  numEdges=2041768213
> You can find the code to generate this bug here: 
> https://gist.github.com/npanj/92e949d86d08715bf4bf
> Note: Nodes are labeled are 1...6B .
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3169) make-distribution.sh failed

2014-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107805#comment-14107805
 ] 

Apache Spark commented on SPARK-3169:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/2101

> make-distribution.sh failed
> ---
>
> Key: SPARK-3169
> URL: https://issues.apache.org/jira/browse/SPARK-3169
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Guoqiang Li
>Priority: Blocker
>
> {code}./make-distribution.sh -Pyarn -Phadoop-2.3 -Phive-thriftserver -Phive 
> -Dhadoop.version=2.3.0 
> {code}
>  =>
> {noformat}
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>   at 
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: scala.reflect.internal.Types$TypeError: bad symbolic reference. A 
> signature in TestSuiteBase.class refers to term dstream
> in package org.apache.spark.streaming which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling 
> TestSuiteBase.class.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3190) Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow somewhere

2014-08-22 Thread npanj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-3190:
-

Description: 
While creating a graph with 6B nodes and 12B edges, I noticed that 
'numVertices' api returns incorrect result; 'numEdges' reports correct number. 
For few times(with different dataset > 2.5B nodes) I have also notices that 
numVertices is returned as -ive number; so I suspect that there is some 
overflow (may be we are using Int for some field?).

Here is some details of experiments  I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
   Graph returns: numVertices=1807028297 ;  numEdges=12163784626

2. Input : numNodes=2157586441 ; noEdges=2747322705
   Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705

3. Input: numNodes=1725060105 ; noEdges=204176821
   Graph: numVertices=1725060105 ;  numEdges=2041768213

You can find the code to generate this bug here: 

https://gist.github.com/npanj/92e949d86d08715bf4bf

Note: Nodes are labeled are 1...6B .














 

  was:
While creating a graph with 6B nodes and 12B edges, I noticed that 
'numVertices' api returns incorrect result; 'numEdges' reports correct number. 
For few times(with different dataset > 2.5B nodes) I have also notices that 
numVertices is returned as -ive number; so I suspect that there is some 
overflow (may be we are using Int for some field?).

Here is some details of experiments  I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
   Graph returns: numVertices=1807028297 ;  numEdges=12163784626

2. Input : numNodes=2157586441 ; noEdges=2747322705
   Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705

3. Input: numNodes=1725060105 ; noEdges=204176821
   Graph: numVertices=1725060105 ;  numEdges=2041768213

You can find the code to generate this bug here: 

https://gist.github.com/npanj/92e949d86d08715bf4bf














 


> Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow 
> somewhere
> ---
>
> Key: SPARK-3190
> URL: https://issues.apache.org/jira/browse/SPARK-3190
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.3
> Environment: Standalone mode running on EC2 . Using latest code from 
> master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
>Reporter: npanj
>Priority: Critical
>
> While creating a graph with 6B nodes and 12B edges, I noticed that 
> 'numVertices' api returns incorrect result; 'numEdges' reports correct 
> number. For few times(with different dataset > 2.5B nodes) I have also 
> notices that numVertices is returned as -ive number; so I suspect that there 
> is some overflow (may be we are using Int for some field?).
> Here is some details of experiments  I have done so far: 
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
>Graph returns: numVertices=1807028297 ;  numEdges=12163784626
> 2. Input : numNodes=2157586441 ; noEdges=2747322705
>Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
>Graph: numVertices=1725060105 ;  numEdges=2041768213
> You can find the code to generate this bug here: 
> https://gist.github.com/npanj/92e949d86d08715bf4bf
> Note: Nodes are labeled are 1...6B .
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3190) Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow somewhere

2014-08-22 Thread npanj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-3190:
-

Environment: Standalone mode running on EC2 . Using latest code from master 
branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .  (was: 
Standalone mode running on EC2 )

> Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow 
> somewhere
> ---
>
> Key: SPARK-3190
> URL: https://issues.apache.org/jira/browse/SPARK-3190
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.3
> Environment: Standalone mode running on EC2 . Using latest code from 
> master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
>Reporter: npanj
>Priority: Critical
>
> While creating a graph with 6B nodes and 12B edges, I noticed that 
> 'numVertices' api returns incorrect result; 'numEdges' reports correct 
> number. For few times(with different dataset > 2.5B nodes) I have also 
> notices that numVertices is returned as -ive number; so I suspect that there 
> is some overflow (may be we are using Int for some field?).
> Here is some details of experiments  I have done so far: 
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
>Graph returns: numVertices=1807028297 ;  numEdges=12163784626
> 2. Input : numNodes=2157586441 ; noEdges=2747322705
>Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
>Graph: numVertices=1725060105 ;  numEdges=2041768213
> You can find the code to generate this bug here: 
> https://gist.github.com/npanj/92e949d86d08715bf4bf
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3190) Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow somewhere

2014-08-22 Thread npanj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

npanj updated SPARK-3190:
-

Description: 
While creating a graph with 6B nodes and 12B edges, I noticed that 
'numVertices' api returns incorrect result; 'numEdges' reports correct number. 
For few times(with different dataset > 2.5B nodes) I have also notices that 
numVertices is returned as -ive number; so I suspect that there is some 
overflow (may be we are using Int for some field?).

Here is some details of experiments  I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
   Graph returns: numVertices=1807028297 ;  numEdges=12163784626

2. Input : numNodes=2157586441 ; noEdges=2747322705
   Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705

3. Input: numNodes=1725060105 ; noEdges=204176821
   Graph: numVertices=1725060105 ;  numEdges=2041768213

You can find the code to generate this bug here: 

https://gist.github.com/npanj/92e949d86d08715bf4bf














 

  was:
While creating a graph with 6B nodes and 12B edges, I noticed that 
'numVertices' api returns incorrect result; 'numEdges' reports correct number. 
For few times(with different dataset > 2.5B nodes) I have also notices that 
numVertices is returned as -ive number; so I suspect that there is some 
overflow (may be we are using Int for some field?).

Here is some details of experiments  I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
   Graph returns: numVertices=1807028297 ;  numEdges=12163784626

2. Input : numNodes=2157586441 ; noEdges=2747322705
   Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705

3. Input: numNodes=1725060105 ; noEdges=204176821
   Graph: numVertices=1725060105 ;  numEdges=2041768213

You can find the code to generate this bug here: 

https://gist.github.com/npanj/92e949d86d08715bf4bf













 


> Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow 
> somewhere
> ---
>
> Key: SPARK-3190
> URL: https://issues.apache.org/jira/browse/SPARK-3190
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.3
> Environment: Standalone mode running on EC2 
>Reporter: npanj
>Priority: Critical
>
> While creating a graph with 6B nodes and 12B edges, I noticed that 
> 'numVertices' api returns incorrect result; 'numEdges' reports correct 
> number. For few times(with different dataset > 2.5B nodes) I have also 
> notices that numVertices is returned as -ive number; so I suspect that there 
> is some overflow (may be we are using Int for some field?).
> Here is some details of experiments  I have done so far: 
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
>Graph returns: numVertices=1807028297 ;  numEdges=12163784626
> 2. Input : numNodes=2157586441 ; noEdges=2747322705
>Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
>Graph: numVertices=1725060105 ;  numEdges=2041768213
> You can find the code to generate this bug here: 
> https://gist.github.com/npanj/92e949d86d08715bf4bf
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3190) Creation of large graph(over 2.5B nodes) seems to be broken:possible overflow somewhere

2014-08-22 Thread npanj (JIRA)
npanj created SPARK-3190:


 Summary: Creation of large graph(over 2.5B nodes) seems to be 
broken:possible overflow somewhere
 Key: SPARK-3190
 URL: https://issues.apache.org/jira/browse/SPARK-3190
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.3
 Environment: Standalone mode running on EC2 
Reporter: npanj
Priority: Critical


While creating a graph with 6B nodes and 12B edges, I noticed that 
'numVertices' api returns incorrect result; 'numEdges' reports correct number. 
For few times(with different dataset > 2.5B nodes) I have also notices that 
numVertices is returned as -ive number; so I suspect that there is some 
overflow (may be we are using Int for some field?).

Here is some details of experiments  I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
   Graph returns: numVertices=1807028297 ;  numEdges=12163784626

2. Input : numNodes=2157586441 ; noEdges=2747322705
   Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705

3. Input: numNodes=1725060105 ; noEdges=204176821
   Graph: numVertices=1725060105 ;  numEdges=2041768213

You can find the code to generate this bug here: 

https://gist.github.com/npanj/92e949d86d08715bf4bf













 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3184) Allow user to specify num tasks to use for a table

2014-08-22 Thread Andy Konwinski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107748#comment-14107748
 ] 

Andy Konwinski commented on SPARK-3184:
---

[~marmbrus], did we figure out if this feature is in fact missing right now?

> Allow user to specify num tasks to use for a table
> --
>
> Key: SPARK-3184
> URL: https://issues.apache.org/jira/browse/SPARK-3184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Andy Konwinski
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-22 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang closed SPARK-3189.


Resolution: Duplicate

> Add Robust Regression Algorithm with Turkey bisquare weight  function 
> (Biweight Estimates) 
> ---
>
> Key: SPARK-3189
> URL: https://issues.apache.org/jira/browse/SPARK-3189
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Fan Jiang
>Priority: Critical
>  Labels: features
> Fix For: 1.1.1, 1.2.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression to employ a 
> fitting criterion that is not as vulnerable as least square.
> The Turkey bisquare weight function, also referred to as the biweight 
> function, produces and M-estimator that is more resistant to regression 
> outliers than the Huber M-estimator ()Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-22 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-3189:


 Summary: Add Robust Regression Algorithm with Turkey bisquare 
weight  function (Biweight Estimates) 
 Key: SPARK-3189
 URL: https://issues.apache.org/jira/browse/SPARK-3189
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
 Fix For: 1.1.1, 1.2.0


Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces and M-estimator that is more resistant to regression outliers than the 
Huber M-estimator ()Andersen 2008: 19).





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3188) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-22 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-3188:


 Summary: Add Robust Regression Algorithm with Turkey bisquare 
weight  function (Biweight Estimates) 
 Key: SPARK-3188
 URL: https://issues.apache.org/jira/browse/SPARK-3188
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
 Fix For: 1.1.1, 1.2.0


Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces and M-estimator that is more resistant to regression outliers than the 
Huber M-estimator ()Andersen 2008: 19).





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3102) Add tests for yarn-client mode

2014-08-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107719#comment-14107719
 ] 

Josh Rosen commented on SPARK-3102:
---

Good point; I'll convert this to a subtask of that JIRA.

> Add tests for yarn-client mode
> --
>
> Key: SPARK-3102
> URL: https://issues.apache.org/jira/browse/SPARK-3102
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Josh Rosen
>
> It looks like some of the {{yarn-client}} code paths aren't exercised by any 
> of the existing tests because my pull request was able to introduce a bug 
> that wasn't caught by Jenkins: 
> https://github.com/apache/spark/pull/2002#discussion-diff-16331781
> We should eventually add tests for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3102) Add tests for yarn-client mode

2014-08-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3102:
--

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-2778

> Add tests for yarn-client mode
> --
>
> Key: SPARK-3102
> URL: https://issues.apache.org/jira/browse/SPARK-3102
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Josh Rosen
>
> It looks like some of the {{yarn-client}} code paths aren't exercised by any 
> of the existing tests because my pull request was able to introduce a bug 
> that wasn't caught by Jenkins: 
> https://github.com/apache/spark/pull/2002#discussion-diff-16331781
> We should eventually add tests for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1287) yarn alpha and stable Client calculateAMMemory routines are different

2014-08-22 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-1287.
--

Resolution: Duplicate

duplicate of SPARK-2140

> yarn alpha and stable Client calculateAMMemory routines are different
> -
>
> Key: SPARK-1287
> URL: https://issues.apache.org/jira/browse/SPARK-1287
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Thomas Graves
>
> The yarn alpha version of calculateAMMemory takes into account the minimum 
> resource capability and also subtracts out 
> YarnAllocationHandler.MEMORY_OVERHEAD.
> The yarn stable version just sets the -Xmx to whatever is passed in by the 
> user.
> The 2 of these should be the same. 
>
> Personally I also think its weird how spark currently takes whatever user 
> passes in for memory and adds YarnAllocationHandler.MEMORY_OVERHEAD to 
> request to RM. This can be confusing to users.  We should revisit all of this 
> and commonize the stable/alpha code where possible.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3187) Refactor and cleanup Yarn allocator code

2014-08-22 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-3187:
-

 Summary: Refactor and cleanup Yarn allocator code
 Key: SPARK-3187
 URL: https://issues.apache.org/jira/browse/SPARK-3187
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Marcelo Vanzin
Priority: Minor


This is a follow-up to SPARK-2933, which dealt with the ApplicationMaster code.

There's a lot of logic in the container allocation code in alpha/stable that 
could probably be merged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3186) Enable parallelism for Reduce Side Join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho closed SPARK-3186.



> Enable parallelism for Reduce Side Join [Spark Branch] 
> ---
>
> Key: SPARK-3186
> URL: https://issues.apache.org/jira/browse/SPARK-3186
> Project: Spark
>  Issue Type: Bug
>Reporter: Szehon Ho
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3186) Enable parallelism for Reduce Side Join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho resolved SPARK-3186.
--

Resolution: Invalid

Sorry please ignore, meant to file this in Hive project.

> Enable parallelism for Reduce Side Join [Spark Branch] 
> ---
>
> Key: SPARK-3186
> URL: https://issues.apache.org/jira/browse/SPARK-3186
> Project: Spark
>  Issue Type: Bug
>Reporter: Szehon Ho
>
> Blocked by SPARK-2978.  See parent JIRA for design details.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3186) Enable parallelism for Reduce Side Join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szehon Ho updated SPARK-3186:
-

Description: (was: Blocked by SPARK-2978.  See parent JIRA for design 
details.)

> Enable parallelism for Reduce Side Join [Spark Branch] 
> ---
>
> Key: SPARK-3186
> URL: https://issues.apache.org/jira/browse/SPARK-3186
> Project: Spark
>  Issue Type: Bug
>Reporter: Szehon Ho
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3186) Enable parallelism for Reduce Side Join [Spark Branch]

2014-08-22 Thread Szehon Ho (JIRA)
Szehon Ho created SPARK-3186:


 Summary: Enable parallelism for Reduce Side Join [Spark Branch] 
 Key: SPARK-3186
 URL: https://issues.apache.org/jira/browse/SPARK-3186
 Project: Spark
  Issue Type: Bug
Reporter: Szehon Ho


Blocked by SPARK-2978.  See parent JIRA for design details.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2140) yarn stable client doesn't properly handle MEMORY_OVERHEAD for AM

2014-08-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107499#comment-14107499
 ] 

Marcelo Vanzin commented on SPARK-2140:
---

Same as SPARK-1287?

> yarn stable client doesn't properly handle MEMORY_OVERHEAD for AM
> -
>
> Key: SPARK-2140
> URL: https://issues.apache.org/jira/browse/SPARK-2140
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
> Fix For: 1.0.1, 1.1.0
>
>
> The yarn stable client doesn't properly remove the MEMORY_OVERHEAD amount 
> from the java heap size, the code to handle that is commented out (see 
> function calculateAMMemory).  We should fix this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3090) Avoid not stopping SparkContext with YARN Client mode

2014-08-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107488#comment-14107488
 ] 

Marcelo Vanzin commented on SPARK-3090:
---

I think that if we want to add this, it would be better to do so for all modes, 
not just yarn-client. Basically have SparkContext itself register a shutdown 
hook to shut itself down, and publish the priority of the hook so that apps / 
backends can register hooks that run before it (see Hadoop's 
ShutdownHookManager for the priority thing - http://goo.gl/BQ1bjk).

That way the code in the yarn-cluster backend can be removed too.

>  Avoid not stopping SparkContext with YARN Client mode
> --
>
> Key: SPARK-3090
> URL: https://issues.apache.org/jira/browse/SPARK-3090
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> When we use YARN Cluster mode, ApplicationMaser register a shutdown hook, 
> stopping SparkContext.
> Thanks to this, SparkContext can stop even if Application forgets to stop 
> SparkContext itself.
> But, unfortunately, YARN Client mode doesn't have such mechanism.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3099) Staging Directory is never deleted when we run job with YARN Client Mode

2014-08-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107481#comment-14107481
 ] 

Marcelo Vanzin commented on SPARK-3099:
---

Pretty sure I covered this in the PR for SPARK-2933.

> Staging Directory is never deleted when we run job with YARN Client Mode
> 
>
> Key: SPARK-3099
> URL: https://issues.apache.org/jira/browse/SPARK-3099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> When we run application with YARN Cluster mode, the class 'ApplicationMaster' 
> is used as ApplicationMaster, which has shutdown hook to cleanup stagind 
> directory (~/.sparkStaging).
> But, when we run application with YARN Client mode, the class 
> 'ExecutorLauncher' as an ApplicationMaster doesn't cleanup staging directory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3102) Add tests for yarn-client mode

2014-08-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107480#comment-14107480
 ] 

Marcelo Vanzin commented on SPARK-3102:
---

SPARK-2778?

> Add tests for yarn-client mode
> --
>
> Key: SPARK-3102
> URL: https://issues.apache.org/jira/browse/SPARK-3102
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Josh Rosen
>
> It looks like some of the {{yarn-client}} code paths aren't exercised by any 
> of the existing tests because my pull request was able to introduce a bug 
> that wasn't caught by Jenkins: 
> https://github.com/apache/spark/pull/2002#discussion-diff-16331781
> We should eventually add tests for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3101) Missing volatile annotation in ApplicationMaster

2014-08-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107478#comment-14107478
 ] 

Marcelo Vanzin commented on SPARK-3101:
---

I covered this in the PR for SPARK-2933 also.

> Missing volatile annotation in ApplicationMaster
> 
>
> Key: SPARK-3101
> URL: https://issues.apache.org/jira/browse/SPARK-3101
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>
> In ApplicationMaster, a field variable 'isLastAMRetry' is used as a flag but 
> it's not declared as volatile though it's used from multiple threads.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3107) Don't pass null jar to executor in yarn-client mode

2014-08-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107476#comment-14107476
 ] 

Marcelo Vanzin commented on SPARK-3107:
---

The {{--jar}} issue is fixed in SPARK-2933.

Where do you see the sys properties issue? Setting a system property to an 
empty value is semantically different from not setting it, although I'm 
sceptical it would make a difference here.

> Don't pass null jar to executor in yarn-client mode
> ---
>
> Key: SPARK-3107
> URL: https://issues.apache.org/jira/browse/SPARK-3107
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> In the following line, ExecutorLauncher's `--jar` takes in null.
> {code}
> 14/08/18 20:52:43 INFO yarn.Client:   command: $JAVA_HOME/bin/java -server 
> -Xmx512m ... org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' 
> --jar null  --arg  'ip-172-31-0-12.us-west-2.compute.internal:56838' 
> --executor-memory 1024 --executor-cores 1 --num-executors  2
> {code}
> Also it appears that we set a bunch of system properties to empty strings 
> (not shown). We should avoid setting these if they don't actually contain 
> values.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2014-08-22 Thread Jeremy Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107388#comment-14107388
 ] 

Jeremy Chambers commented on SPARK-3185:


Working on rebuilding client with Hadoop 2.

> SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
> JOURNAL_FOLDER
> ---
>
> Key: SPARK-3185
> URL: https://issues.apache.org/jira/browse/SPARK-3185
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2
> Environment: Amazon Linux AMI
> [ec2-user@ip-172-30-1-145 ~]$ uname -a
> Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
> UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
> The build I used (and MD5 verified):
> [ec2-user@ip-172-30-1-145 ~]$ wget 
> http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
>Reporter: Jeremy Chambers
>
> org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
> communicate with client version 4
> When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
> exception is thrown when "Formatting JOURNAL_FOLDER".
> No exception occurs when I launch on Hadoop 1.
> Launch used:
> ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
> --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
> sparkProd
> log snippet
> Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
> Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
> communicate with client version 4
> at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
> at tachyon.UnderFileSystemHdfs.(UnderFileSystemHdfs.java:73)
> at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
> at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
> at tachyon.Format.main(Format.java:54)
> Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
> communicate with client version 4
> at org.apache.hadoop.ipc.Client.call(Client.java:1070)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
> at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
> at 
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:238)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:203)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
> at tachyon.UnderFileSystemHdfs.(UnderFileSystemHdfs.java:69)
> ... 3 more
> Killed 0 processes
> Killed 0 processes
> ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
> ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
> ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
> ---end snippet---
> *** I don't have this problem when I launch without the 
> "--hadoop-major-version=2" (which defaults to Hadoop 1.x)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2014-08-22 Thread Jeremy Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107300#comment-14107300
 ] 

Jeremy Chambers commented on SPARK-3185:


Cross reference: 
http://apache-spark-user-list.1001560.n3.nabble.com/Server-IPC-version-7-cannot-communicate-with-client-version-4-with-Spark-Streaming-1-0-0-in-Java-ande-tp9908p9914.html
 

> SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting 
> JOURNAL_FOLDER
> ---
>
> Key: SPARK-3185
> URL: https://issues.apache.org/jira/browse/SPARK-3185
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2
> Environment: Amazon Linux AMI
> [ec2-user@ip-172-30-1-145 ~]$ uname -a
> Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
> UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/
> The build I used (and MD5 verified):
> [ec2-user@ip-172-30-1-145 ~]$ wget 
> http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
>Reporter: Jeremy Chambers
>
> org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
> communicate with client version 4
> When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
> exception is thrown when "Formatting JOURNAL_FOLDER".
> No exception occurs when I launch on Hadoop 1.
> Launch used:
> ./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
> --zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
> sparkProd
> log snippet
> Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
> Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
> communicate with client version 4
> at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
> at tachyon.UnderFileSystemHdfs.(UnderFileSystemHdfs.java:73)
> at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
> at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
> at tachyon.Format.main(Format.java:54)
> Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
> communicate with client version 4
> at org.apache.hadoop.ipc.Client.call(Client.java:1070)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
> at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
> at 
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:238)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:203)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
> at tachyon.UnderFileSystemHdfs.(UnderFileSystemHdfs.java:69)
> ... 3 more
> Killed 0 processes
> Killed 0 processes
> ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
> ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
> ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
> ---end snippet---
> *** I don't have this problem when I launch without the 
> "--hadoop-major-version=2" (which defaults to Hadoop 1.x)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3185) SPARK launch on Hadoop 2 in EC2 throws Tachyon exception when Formatting JOURNAL_FOLDER

2014-08-22 Thread Jeremy Chambers (JIRA)
Jeremy Chambers created SPARK-3185:
--

 Summary: SPARK launch on Hadoop 2 in EC2 throws Tachyon exception 
when Formatting JOURNAL_FOLDER
 Key: SPARK-3185
 URL: https://issues.apache.org/jira/browse/SPARK-3185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Amazon Linux AMI

[ec2-user@ip-172-30-1-145 ~]$ uname -a
Linux ip-172-30-1-145 3.10.42-52.145.amzn1.x86_64 #1 SMP Tue Jun 10 23:46:43 
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
https://aws.amazon.com/amazon-linux-ami/2014.03-release-notes/

The build I used (and MD5 verified):
[ec2-user@ip-172-30-1-145 ~]$ wget 
http://supergsego.com/apache/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz


Reporter: Jeremy Chambers


org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate 
with client version 4

When I launch SPARK 1.0.2 on Hadoop 2 in a new EC2 cluster, the above tachyon 
exception is thrown when "Formatting JOURNAL_FOLDER".

No exception occurs when I launch on Hadoop 1.

Launch used:
./spark-ec2 -k spark_cluster -i /home/ec2-user/kagi/spark_cluster.ppk 
--zone=us-east-1a --hadoop-major-version=2 --spot-price=0.0165 -s 3 launch 
sparkProd

log snippet
Formatting Tachyon Master @ ec2-54-80-49-244.compute-1.amazonaws.com
Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
Exception in thread "main" java.lang.RuntimeException: 
org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate 
with client version 4
at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
at tachyon.UnderFileSystemHdfs.(UnderFileSystemHdfs.java:73)
at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
at tachyon.Format.main(Format.java:54)
Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot 
communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at 
org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:203)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at tachyon.UnderFileSystemHdfs.(UnderFileSystemHdfs.java:69)
... 3 more
Killed 0 processes
Killed 0 processes
ec2-54-167-219-159.compute-1.amazonaws.com: Killed 0 processes
ec2-54-198-198-17.compute-1.amazonaws.com: Killed 0 processes
ec2-54-166-36-0.compute-1.amazonaws.com: Killed 0 processes
---end snippet---

*** I don't have this problem when I launch without the 
"--hadoop-major-version=2" (which defaults to Hadoop 1.x)




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3176) Implement 'POWER', 'ABS and 'LAST' for sql

2014-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107237#comment-14107237
 ] 

Apache Spark commented on SPARK-3176:
-

User 'xinyunh' has created a pull request for this issue:
https://github.com/apache/spark/pull/2099

> Implement 'POWER', 'ABS and 'LAST' for sql
> --
>
> Key: SPARK-3176
> URL: https://issues.apache.org/jira/browse/SPARK-3176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.0
> Environment: All
>Reporter: Xinyun Huang
>Priority: Minor
> Fix For: 1.2.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Add support for the mathematical function "POWER" and "ABS" and  the analytic 
> function "last" to return a subset of the rows satisfying a query within 
> spark sql.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-08-22 Thread Erik Erlandson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107233#comment-14107233
 ] 

Erik Erlandson commented on SPARK-2360:
---

It appears that this is not a pure lazy transform, as it invokes 'first()` when 
inferring schema from headers.
I wrote up some ideas on this, pertaining to SPARK-2315, here:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/


> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Hossein Falaki
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-08-22 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107155#comment-14107155
 ] 

Hossein Falaki commented on SPARK-2360:
---

There is a pull request for this issue: 


It did not make it to Spark 1.1, due to last minute API changes. It will make 
it to the next release. The API will provide a very easy (default) way of 
reading common CSV (e.g., comma delimited) into SchemaRDDs. Users will be able 
to specify delimiter and quotation characters.

> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Hossein Falaki
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2921) Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things)

2014-08-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2921:
---

Priority: Blocker  (was: Critical)

> Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other 
> things)
> ---
>
> Key: SPARK-2921
> URL: https://issues.apache.org/jira/browse/SPARK-2921
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>Priority: Blocker
> Fix For: 1.1.0
>
>
> The code path to handle this exists only for the coarse grained mode, and 
> even in this mode the java options aren't passed to the executors properly. 
> We currently pass the entire value of spark.executor.extraJavaOptions to the 
> executors as a string without splitting it. We need to use 
> Utils.splitCommandString as in standalone mode.
> I have not confirmed this, but I would assume spark.executor.extraClassPath 
> and spark.executor.extraLibraryPath are also not propagated correctly in 
> either mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-08-22 Thread Jonathan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107131#comment-14107131
 ] 

Jonathan Kelly commented on SPARK-1981:
---

The code here is a cleaned up version of the code from that article that Chris 
took over from Parviz and integrated into Spark itself, and it will be 
available whenever Spark 1.1 is released.

> Add AWS Kinesis streaming support
> -
>
> Key: SPARK-1981
> URL: https://issues.apache.org/jira/browse/SPARK-1981
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Chris Fregly
>Assignee: Chris Fregly
> Fix For: 1.1.0
>
>
> Add AWS Kinesis support to Spark Streaming.
> Initial discussion occured here:  https://github.com/apache/spark/pull/223
> I discussed this with Parviz from AWS recently and we agreed that I would 
> take this over.
> Look for a new PR that takes into account all the feedback from the earlier 
> PR including spark-1.0-compliant implementation, AWS-license-aware build 
> support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3184) Allow user to specify num tasks to use for a table

2014-08-22 Thread Andy Konwinski (JIRA)
Andy Konwinski created SPARK-3184:
-

 Summary: Allow user to specify num tasks to use for a table
 Key: SPARK-3184
 URL: https://issues.apache.org/jira/browse/SPARK-3184
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Andy Konwinski






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1981) Add AWS Kinesis streaming support

2014-08-22 Thread abhinav bondalapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107085#comment-14107085
 ] 

abhinav bondalapati commented on SPARK-1981:


Reading this 
articlehttps://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923
Does EMR already have a kinesis connector that I can use for spark streaming?

> Add AWS Kinesis streaming support
> -
>
> Key: SPARK-1981
> URL: https://issues.apache.org/jira/browse/SPARK-1981
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Chris Fregly
>Assignee: Chris Fregly
> Fix For: 1.1.0
>
>
> Add AWS Kinesis support to Spark Streaming.
> Initial discussion occured here:  https://github.com/apache/spark/pull/223
> I discussed this with Parviz from AWS recently and we agreed that I would 
> take this over.
> Look for a new PR that takes into account all the feedback from the earlier 
> PR including spark-1.0-compliant implementation, AWS-license-aware build 
> support, tests, comments, and style guide compliance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3183) Add option for requesting full YARN cluster

2014-08-22 Thread Sandy Ryza (JIRA)
Sandy Ryza created SPARK-3183:
-

 Summary: Add option for requesting full YARN cluster
 Key: SPARK-3183
 URL: https://issues.apache.org/jira/browse/SPARK-3183
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Sandy Ryza


This could possibly be in the form of --executor-cores ALL --executor-memory 
ALL --num-executors ALL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-08-22 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107048#comment-14107048
 ] 

Evan Chan commented on SPARK-2360:
--

Vineet,

You can have a look at my repo:
http://www.github.com/velvia/spark-sql-gdelt

See the CsvImporter class and README, it can import tab-delimited CSV
styles (GDELT format), and with little modification other CSV formats.
 It's not terribly efficient because it uses jsonRDD, I plan to make a more
efficient version if I have time.


On Fri, Aug 22, 2014 at 7:06 AM, Hingorani, Vineet (JIRA) 




-- 
The fruit of silence is prayer;
the fruit of prayer is faith;
the fruit of faith is love;
the fruit of love is service;
the fruit of service is peace.  -- Mother Teresa


> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Hossein Falaki
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2742) The variable inputFormatInfo and inputFormatMap never used

2014-08-22 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106997#comment-14106997
 ] 

Thomas Graves commented on SPARK-2742:
--

https://github.com/apache/spark/pull/1614

> The variable inputFormatInfo and inputFormatMap never used
> --
>
> Key: SPARK-2742
> URL: https://issues.apache.org/jira/browse/SPARK-2742
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: meiyoula
>Priority: Minor
> Fix For: 1.2.0
>
>
> the ClientArguments class has two never used variables, one is 
> inputFormatInfo, the other is inputFormatMap



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2742) The variable inputFormatInfo and inputFormatMap never used

2014-08-22 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-2742.
--

   Resolution: Fixed
Fix Version/s: 1.2.0

> The variable inputFormatInfo and inputFormatMap never used
> --
>
> Key: SPARK-2742
> URL: https://issues.apache.org/jira/browse/SPARK-2742
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: meiyoula
>Priority: Minor
> Fix For: 1.2.0
>
>
> the ClientArguments class has two never used variables, one is 
> inputFormatInfo, the other is inputFormatMap



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-08-22 Thread Hingorani, Vineet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106851#comment-14106851
 ] 

Hingorani, Vineet commented on SPARK-2360:
--

Hello Michael,

I saw your comment thread on a mail archive regarding having to be able to 
manipulate csv files using spark. Could you please give some information as to 
do have this functionality now in the latest release of Spark? I have installed 
the lates version as of now and running it on my local machine.

Thank you

Regards,

Vineet Hingorani
Developer Associate
Custom Development & Strategic Projects group (CD&SP)
Products & Innovation (P&I)
SAP SE
WDF 03, C3.03
E vineet.hingor...@sap.com



> CSV import to SchemaRDDs
> 
>
> Key: SPARK-2360
> URL: https://issues.apache.org/jira/browse/SPARK-2360
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Hossein Falaki
>
> I think the first step it to design the interface that we want to present to 
> users.  Mostly this is defining options when importing.  Off the top of my 
> head:
> - What is the separator?
> - Provide column names or infer them from the first row.
> - how to handle multiple files with possibly different schemas
> - do we have a method to let users specify the datatypes of the columns or 
> are they just strings?
> - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3182) Twitter Streaming Geoloaction Filter

2014-08-22 Thread Daniel Kershaw (JIRA)
Daniel Kershaw created SPARK-3182:
-

 Summary: Twitter Streaming Geoloaction Filter
 Key: SPARK-3182
 URL: https://issues.apache.org/jira/browse/SPARK-3182
 Project: Spark
  Issue Type: Wish
  Components: Streaming
Affects Versions: 1.0.2, 1.0.0
Reporter: Daniel Kershaw
 Fix For: 1.2.0


Add a geolocation filter to the Twitter Streaming Component. 

This should take a sequence of double to indicate the bounding box for the 
stream. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-08-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14106679#comment-14106679
 ] 

Apache Spark commented on SPARK-3181:
-

User 'fjiang6' has created a pull request for this issue:
https://github.com/apache/spark/pull/2096

> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Fan Jiang
>Priority: Critical
>  Labels: features
> Fix For: 1.1.1, 1.2.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-08-22 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-3181:


 Summary: Add Robust Regression Algorithm with Huber Estimator
 Key: SPARK-3181
 URL: https://issues.apache.org/jira/browse/SPARK-3181
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
 Fix For: 1.1.1, 1.2.0


Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression  to employ a fitting 
criterion that is not as vulnerable as least square.

In 1973, Huber introduced M-estimation for regression which stands for "maximum 
likelihood type". The method is resistant to outliers in the response variable 
and has been widely used.

The new feature for MLlib will contain 3 new files
/main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
/test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
/main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala

and one new class HuberRobustGradient in 
/main/scala/org/apache/spark/mllib/optimization/Gradient.scala




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org