[jira] [Commented] (SPARK-4854) Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2014-12-16 Thread Shenghua Wan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247938#comment-14247938
 ] 

Shenghua Wan commented on SPARK-4854:
-

I was trying to debugging Spark source code to fix this issue. And I found the 
function name resolution might not work, where normal situation like 
SELECT+CustomUDTF resolves the function name as FQDN class name, but in 
situation like SELECT+LATERAL VIEW+CustomUDTF resolves the function name as 
just the alias I gave in create temporary function clause rather than FQDN. 
In other words, the function name to class name translation failed in that 
situation.

A workaround trick is discovered by debugging the Spark source code.

The trick is to remove the package name in the Java code of your custom UDTF, 
like org.xxx. In that case, your class name equals its FQDN. In addition the 
class name is used as the alias function name in create temporary function 
clause. In that case, though Spark use the unresolved alias function name, but 
this name can be resolved in the default name space. 

 Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI
 -

 Key: SPARK-4854
 URL: https://issues.apache.org/jira/browse/SPARK-4854
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0, 1.1.1
Reporter: Shenghua Wan

 Hello, 
 I met a problem when using Spark sql CLI. A custom UDTF with lateral view 
 throws ClassNotFound exception. I did a couple of experiments in same 
 environment (spark version 1.1.0, 1.1.1): 
 select + same custom UDTF (Passed) 
 select + lateral view + custom UDTF (ClassNotFoundException) 
 select + lateral view + built-in UDTF (Passed) 
 I have done some googling there days and found one related issue ticket of 
 Spark 
 https://issues.apache.org/jira/browse/SPARK-4811
 which is about Custom UDTFs not working in Spark SQL. 
 It should be helpful to put actual code here to reproduce the problem. 
 However,  corporate regulations might prohibit this. So sorry about this. 
 Directly using explode's source code in a jar will help anyway. 
 Here is a portion of stack print when exception, just in case: 
 java.lang.ClassNotFoundException: XXX 
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) 
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355) 
 at java.security.AccessController.doPrivileged(Native Method) 
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354) 
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425) 
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358) 
 at 
 org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:81)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.createFunction(hiveUdfs.scala:247) 
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:254)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:254) 
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors$lzycompute(hiveUdfs.scala:261)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors(hiveUdfs.scala:260)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:265)
  
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:265) 
 at 
 org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:269) 
 at 
 org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50)
  
 at scala.Option.map(Option.scala:145) 
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:50)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:60)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79)
  
 at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79)
  
 at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
  
 at 
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
  
 at scala.collection.immutable.List.foreach(List.scala:318) 
 at 
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) 
 at 
 scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) 
 the rest is omitted. 



[jira] [Commented] (SPARK-4857) Add Executor Events to SparkListener

2014-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14247955#comment-14247955
 ] 

Apache Spark commented on SPARK-4857:
-

User 'ksakellis' has created a pull request for this issue:
https://github.com/apache/spark/pull/3711

 Add Executor Events to SparkListener
 

 Key: SPARK-4857
 URL: https://issues.apache.org/jira/browse/SPARK-4857
 Project: Spark
  Issue Type: Improvement
Reporter: Kostas Sakellis

 We need to add events to the SparkListener to indicate an executor has been 
 added or removed with corresponding information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4341) Spark need to set num-executors automatically

2014-12-16 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated SPARK-4341:
-
Attachment: SPARK-4341.diff

 Spark need to set num-executors automatically
 -

 Key: SPARK-4341
 URL: https://issues.apache.org/jira/browse/SPARK-4341
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Hong Shen
 Attachments: SPARK-4341.diff


 The mapreduce job can set maptask automaticlly, but in spark, we have to set 
 num-executors, executor memory and cores. It's difficult for users to set 
 these args, especially for the users want to use spark sql. So when user 
 havn't set num-executors,  spark should set num-executors automatically 
 accroding to the input partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2988) Port repl to scala 2.11.

2014-12-16 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-2988.

Resolution: Fixed

 Port repl to scala 2.11.
 

 Key: SPARK-2988
 URL: https://issues.apache.org/jira/browse/SPARK-2988
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Prashant Sharma
Assignee: Prashant Sharma





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2987) Adjust build system to support building with scala 2.11 and fix tests.

2014-12-16 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-2987.

Resolution: Fixed

 Adjust build system to support building with scala 2.11 and fix tests.
 --

 Key: SPARK-2987
 URL: https://issues.apache.org/jira/browse/SPARK-2987
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Prashant Sharma
Assignee: Prashant Sharma





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2709) Add a tool for certifying Spark API compatiblity

2014-12-16 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-2709.

Resolution: Fixed

 Add a tool for certifying Spark API compatiblity
 

 Key: SPARK-2709
 URL: https://issues.apache.org/jira/browse/SPARK-2709
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma

 As Spark is packaged by more and more distributors, it would be good to have 
 a tool that verifies API compatiblity of a provided Spark package. The tool 
 would certify that a vendor distrubtion of Spark contains all of the API's 
 present in a particular upstream Spark version.
 This will help vendors make sure they remain API compliant when they make 
 changes or back ports to Spark. It will also discourage vendors from 
 knowingly breaking API's, because anyone can audit their distribution and see 
 that they have removed support for certain API's.
 I'm hoping a tool like this will avoid API fragmentation in the Spark 
 community.
 One poor man's implementation of this is that a vendor can just run the 
 binary compatibility checks in the spark build against an upstream version of 
 Spark. That's a pretty good start, but it means you can't come as a third 
 party and audit a distribution.
 Another approach would be to have something where anyone can come in and 
 audit a distribution even if they don't have access to the packaging and 
 source code. That would look something like this:
 1. For each release we publish a manifest of all public API's (we might 
 borrow the MIMA string representation of bye code signatures)
 2. We package an auditing tool as a jar file.
 3. The user runs a tool with spark-submit that reflectively walks through all 
 exposed Spark API's and makes sure that everything on the manifest is 
 encountered.
 From the implementation side, this is just brainstorming at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1338) Create Additional Style Rules

2014-12-16 Thread Prashant Sharma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma resolved SPARK-1338.

Resolution: Fixed

This is fixed by updating the scalastyle version. Applying these styles accross 
the code base can be a different issue.

 Create Additional Style Rules
 -

 Key: SPARK-1338
 URL: https://issues.apache.org/jira/browse/SPARK-1338
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Prashant Sharma
 Fix For: 1.2.0


 There are a few other rules that would be helpful to have. Also we should add 
 tests for these rules because it's easy to get them wrong. I gave some 
 example comparisons from a javascript style checker.
 Require spaces in type declarations:
 def foo:String = X // no
 def foo: String = XXX
 def x:Int = 100 // no
 val x: Int = 100
 Require spaces after keywords:
 if(x - 3) // no
 if (x + 10)
 See: requireSpaceAfterKeywords from
 https://github.com/mdevils/node-jscs
 Disallow spaces inside of parentheses:
 val x = ( 3 + 5 ) // no
 val x = (3 + 5)
 See: disallowSpacesInsideParentheses from
 https://github.com/mdevils/node-jscs
 Require spaces before and after binary operators:
 See: requireSpaceBeforeBinaryOperators
 See: disallowSpaceAfterBinaryOperators
 from https://github.com/mdevils/node-jscs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2709) Add a tool for certifying Spark API compatiblity

2014-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248174#comment-14248174
 ] 

Sean Owen commented on SPARK-2709:
--

Was this sort of tool implemented?

 Add a tool for certifying Spark API compatiblity
 

 Key: SPARK-2709
 URL: https://issues.apache.org/jira/browse/SPARK-2709
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Prashant Sharma

 As Spark is packaged by more and more distributors, it would be good to have 
 a tool that verifies API compatiblity of a provided Spark package. The tool 
 would certify that a vendor distrubtion of Spark contains all of the API's 
 present in a particular upstream Spark version.
 This will help vendors make sure they remain API compliant when they make 
 changes or back ports to Spark. It will also discourage vendors from 
 knowingly breaking API's, because anyone can audit their distribution and see 
 that they have removed support for certain API's.
 I'm hoping a tool like this will avoid API fragmentation in the Spark 
 community.
 One poor man's implementation of this is that a vendor can just run the 
 binary compatibility checks in the spark build against an upstream version of 
 Spark. That's a pretty good start, but it means you can't come as a third 
 party and audit a distribution.
 Another approach would be to have something where anyone can come in and 
 audit a distribution even if they don't have access to the packaging and 
 source code. That would look something like this:
 1. For each release we publish a manifest of all public API's (we might 
 borrow the MIMA string representation of bye code signatures)
 2. We package an auditing tool as a jar file.
 3. The user runs a tool with spark-submit that reflectively walks through all 
 exposed Spark API's and makes sure that everything on the manifest is 
 encountered.
 From the implementation side, this is just brainstorming at this point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl

2014-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248176#comment-14248176
 ] 

Sean Owen commented on SPARK-3955:
--

[~jongyoul] That PR was never merged though: 
https://github.com/apache/spark/pull/3379
You closed yours too: https://github.com/apache/spark/pull/2818
I don't think this is resolved.

 Different versions between jackson-mapper-asl and jackson-core-asl
 --

 Key: SPARK-3955
 URL: https://issues.apache.org/jira/browse/SPARK-3955
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Jongyoul Lee

 In the parent pom.xml, specified a version of jackson-mapper-asl. This is 
 used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl 
 is not same as jackson-core-asl. This is because other libraries use several 
 versions of jackson, so other version of jackson-core-asl is assembled. 
 Simply, fix this problem if pom.xml has a specific version information of 
 jackson-core-asl. If it's not set, a version 1.9.11 is merged info 
 assembly.jar and we cannot use jackson library properly.
 {code}
 [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the 
 shaded jar.
 [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the 
 shaded jar.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4777) Some block memory after unrollSafely not count into used memory(memoryStore.entrys or unrollMemory)

2014-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248177#comment-14248177
 ] 

Sean Owen commented on SPARK-4777:
--

[~SuYan] Given your comment in your PR, do you intend this to be considered a 
duplicate of SPARK-3000?

 Some block memory after unrollSafely not count into used 
 memory(memoryStore.entrys or unrollMemory)
 ---

 Key: SPARK-4777
 URL: https://issues.apache.org/jira/browse/SPARK-4777
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: SuYan
Priority: Minor

 Some memory not count into memory used by memoryStore or unrollMemory.
 Thread A after unrollsafely memory, it will release 40MB unrollMemory(40MB 
 will used by other threads). then ThreadA wait get accountingLock to tryToPut 
 blockA(30MB). before Thread A get accountingLock, blockA memory size is not 
 counting into unrollMemory or memoryStore.currentMemory.
   
  IIUC, freeMemory should minus that block memory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-16 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4856:
-
Description: 
We have data like:
{quote}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{quote}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.

  was:
We have data like:
{code:java}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{code:xml}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.


 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 We have data like:
 {quote}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {quote}
 As empty string (the headers) will be considered as String in the beginning 
 (in line 2 and 3), it ignores the real nested data type (struct type 
 headers in line 1), and also take the line 1 (the headers) as String 
 Type, which is not our expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-16 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4856:
-
Description: 
We have data like:
{panel}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{panel}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.

  was:
We have data like:
{quote}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{quote}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.


 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 We have data like:
 {panel}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {panel}
 As empty string (the headers) will be considered as String in the beginning 
 (in line 2 and 3), it ignores the real nested data type (struct type 
 headers in line 1), and also take the line 1 (the headers) as String 
 Type, which is not our expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-16 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4856:
-
Description: 
We have data like:
{noformat}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{noformat}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.

  was:
dWe have data like:
{noformat}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{noformat}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.


 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 We have data like:
 {noformat}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {noformat}
 As empty string (the headers) will be considered as String in the beginning 
 (in line 2 and 3), it ignores the real nested data type (struct type 
 headers in line 1), and also take the line 1 (the headers) as String 
 Type, which is not our expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-16 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4856:
-
Description: 
dWe have data like:
{noformat}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{noformat}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.

  was:
We have data like:
{panel}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{panel}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.


 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 dWe have data like:
 {noformat}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {noformat}
 As empty string (the headers) will be considered as String in the beginning 
 (in line 2 and 3), it ignores the real nested data type (struct type 
 headers in line 1), and also take the line 1 (the headers) as String 
 Type, which is not our expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-16 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4856:
-
Description: 
We have data like:
{noformat}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{noformat}

As empty string (the headers) will be considered as String, and it ignores 
the real nested data type (struct type headers in line 1), and then we will 
get the headers as String Type, which is not our expectation.

  was:
We have data like:
{noformat}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{noformat}

As empty string (the headers) will be considered as String in the beginning 
(in line 2 and 3), it ignores the real nested data type (struct type headers 
in line 1), and also take the line 1 (the headers) as String Type, which is 
not our expected.


 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 We have data like:
 {noformat}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {noformat}
 As empty string (the headers) will be considered as String, and it ignores 
 the real nested data type (struct type headers in line 1), and then we will 
 get the headers as String Type, which is not our expectation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2014-12-16 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-4856:
-
Description: 
We have data like:
{noformat}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{noformat}

As empty string (the headers) will be considered as String, and it ignores 
the real nested data type (struct type headers in line 1), and then we will 
get the headers (in line 1) as String Type, which is not our expectation.

  was:
We have data like:
{noformat}
TestSQLContext.sparkContext.parallelize(
  {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
::
  {ip:27.31.100.29,headers:{}} ::
  {ip:27.31.100.29,headers:} :: Nil)
{noformat}

As empty string (the headers) will be considered as String, and it ignores 
the real nested data type (struct type headers in line 1), and then we will 
get the headers as String Type, which is not our expectation.


 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao

 We have data like:
 {noformat}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {noformat}
 As empty string (the headers) will be considered as String, and it ignores 
 the real nested data type (struct type headers in line 1), and then we will 
 get the headers (in line 1) as String Type, which is not our expectation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4732) All application progress on the standalone scheduler can be halted by one systematically faulty node

2014-12-16 Thread Harry Brundage (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248367#comment-14248367
 ] 

Harry Brundage commented on SPARK-4732:
---

Seems like it would, feel free to mark as duplicate!

 All application progress on the standalone scheduler can be halted by one 
 systematically faulty node
 

 Key: SPARK-4732
 URL: https://issues.apache.org/jira/browse/SPARK-4732
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
 Environment:  - Spark Standalone scheduler
Reporter: Harry Brundage

 We've experienced several cluster wide outages caused by unexpected system 
 wide faults on one of our spark workers if that worker is failing 
 systematically. By systematically, I mean that every executor launched by 
 that worker will definitely fail due to some reason out of Spark's control 
 like the log directory disk being completely out of space, or a permissions 
 error for a file that's always read during executor launch. We screw up all 
 the time on our team and cause stuff like this to happen, but because of the 
 way the standalone scheduler allocates resources, our cluster doesn't recover 
 gracefully from these failures. 
 When there are more tasks to do than executors, I am pretty sure the way the 
 scheduler works is that it just waits for more resource offers and then 
 allocates tasks from the queue to those resources. If an executor dies 
 immediately after starting, the worker monitor process will notice that it's 
 dead. The master will allocate that worker's now free cores/memory to a 
 currently running application that is below its spark.cores.max, which in our 
 case I've observed as usually the app that just had the executor die. A new 
 executor gets spawned on the same worker that the last one just died on, gets 
 allocated that one task that failed, and then the whole process fails again 
 for the same systematic reason, and lather rinse repeat. This happens 10 
 times or whatever the max task failure count is, and then the whole app is 
 deemed a failure by the driver and shut down completely.
 This happens to us for all applications in the cluster as well. We usually 
 run roughly as many cores as we have hadoop nodes. We also usually have many 
 more input splits than we have tasks, which means the locality of the first 
 few tasks which I believe determines where our executors run is well spread 
 out over the cluster, and often covers 90-100% of nodes. This means the 
 likelihood of any application getting an executor scheduled any broken node 
 is quite high. After an old application goes through the above mentioned 
 process and dies, the next application to start or not be at it's requested 
 max capacity gets an executor scheduled on the broken node, and is promptly 
 taken down as well. This happens over and over as well, to the point where 
 none of our spark jobs are making any progress because of one tiny 
 permissions mistake on one node.
 Now, I totally understand this is usually an error between keyboard and 
 screen kind of situation where it is the responsibility of the people 
 deploying spark to ensure it is deployed correctly. The systematic issues 
 we've encountered are almost always of this nature: permissions errors, disk 
 full errors, one node not getting a new spark jar from a configuration error, 
 configurations being out of sync, etc. That said, disks are going to fail or 
 half fail, fill up, node rot is going to ruin configurations, etc etc etc, 
 and as hadoop clusters scale in size this becomes more and more likely, so I 
 think its reasonable to ask that Spark be resilient to this kind of failure 
 and keep on truckin'. 
 I think a good simple fix would be to have applications, or the master, 
 blacklist workers (not executors) at a failure count lower than the task 
 failure count. This would also serve as a belt and suspenders fix for 
 SPARK-4498.
  If the scheduler stopped trying to schedule on nodes that fail a lot, we 
 could still make progress. These blacklist events are really important and I 
 think would need to be well logged and surfaced in the UI, but I'd rather log 
 and carry on than fail hard. I think the tradeoff here is that you risk 
 blacklisting ever worker as well if there is something systematically wrong 
 with communication or whatever else I can't imagine.
 Please let me know if I've misunderstood how the scheduler works or you need 
 more information or anything like that and I'll be happy to provide. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Created] (SPARK-4861) Refactory command in spark sql

2014-12-16 Thread wangfei (JIRA)
wangfei created SPARK-4861:
--

 Summary: Refactory command in spark sql
 Key: SPARK-4861
 URL: https://issues.apache.org/jira/browse/SPARK-4861
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1
Reporter: wangfei
 Fix For: 1.3.0


Fix a todo in spark sql:  remove ```Command``` and use ```RunnableCommand``` 
instead.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4861) Refactory command in spark sql

2014-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248380#comment-14248380
 ] 

Apache Spark commented on SPARK-4861:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3712

 Refactory command in spark sql
 --

 Key: SPARK-4861
 URL: https://issues.apache.org/jira/browse/SPARK-4861
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.1
Reporter: wangfei
 Fix For: 1.3.0


 Fix a todo in spark sql:  remove ```Command``` and use ```RunnableCommand``` 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4862) Streaming | Setting checkpoint as a local directory results in Checkpoint RDD has different partitions error

2014-12-16 Thread Aniket Bhatnagar (JIRA)
Aniket Bhatnagar created SPARK-4862:
---

 Summary: Streaming | Setting checkpoint as a local directory 
results in Checkpoint RDD has different partitions error
 Key: SPARK-4862
 URL: https://issues.apache.org/jira/browse/SPARK-4862
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.1, 1.1.0
Reporter: Aniket Bhatnagar
Priority: Minor


If the checkpoint is set as a local filesystem directory, it results in weird 
error messages like the following:

org.apache.spark.SparkException: Checkpoint RDD CheckpointRDD[467] at apply at 
List.scala:318(0) has different number of partitions than original RDD 
MapPartitionsRDD[461] at mapPartitions at StateDStream.scala:71(56)

It would be great if Spark could output better error message that better hints 
at what could have gone wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4863) Suspicious exception handlers

2014-12-16 Thread Ding Yuan (JIRA)
Ding Yuan created SPARK-4863:


 Summary: Suspicious exception handlers
 Key: SPARK-4863
 URL: https://issues.apache.org/jira/browse/SPARK-4863
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Ding Yuan


Following up with the discussion in 
https://issues.apache.org/jira/browse/SPARK-1148, I am creating a new JIRA to 
report the suspicious exception handlers detected by our tool aspirator on 
spark-1.1.1. 



==
WARNING: TODO;  in handler.
  Line: 129, File: org/apache/thrift/transport/TNonblockingServerSocket.java

122:  public void registerSelector(Selector selector) {
123:try {
124:  // Register the server socket channel, indicating an interest in
125:  // accepting new connections
126:  serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);
127:} catch (ClosedChannelException e) {
128:  // this shouldn't happen, ideally...
129:  // TODO: decide what to do with this.
130:}
131:  }

==

==
WARNING: TODO;  in handler.
  Line: 1583, File: org/apache/spark/SparkContext.scala

1578: val scheduler = try {
1579:   val clazz = 
Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler)
1580:   val cons = clazz.getConstructor(classOf[SparkContext])
1581:   cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
1582: } catch {
1583:   // TODO: Enumerate the exact reasons why it can fail
1584:   // But irrespective of it, it means we cannot proceed !
1585:   case e: Exception = {
1586: throw new SparkException(YARN mode not available ?, e)
1587:   }

==

==
WARNING 1: empty handler for exception: java.lang.Exception
THERE IS NO LOG MESSAGE!!!
  Line: 75, File: org/apache/spark/repl/ExecutorClassLoader.scala

try {
  val pathInDirectory = name.replace('.', '/') + .class
  val inputStream = {
if (fileSystem != null) {
  fileSystem.open(new Path(directory, pathInDirectory))
} else {
  if (SparkEnv.get.securityManager.isAuthenticationEnabled()) {
val uri = new URI(classUri + / + urlEncode(pathInDirectory))
val newuri = Utils.constructURIForAuthentication(uri, 
SparkEnv.get.securityManager)
newuri.toURL().openStream()
  } else {
new URL(classUri + / + urlEncode(pathInDirectory)).openStream()
  }
}
  }
  val bytes = readAndTransformClass(name, inputStream)
  inputStream.close()
  Some(defineClass(name, bytes, 0, bytes.length))
} catch {
  case e: Exception = None
}

==

==
WARNING 1: empty handler for exception: java.io.IOException
THERE IS NO LOG MESSAGE!!!
  Line: 275, File: org/apache/spark/util/Utils.scala

  try {
dir = new File(root, spark- + UUID.randomUUID.toString)
if (dir.exists() || !dir.mkdirs()) {
  dir = null
}
  } catch { case e: IOException = ; }

==

==
WARNING 1: empty handler for exception: java.lang.InterruptedException
THERE IS NO LOG MESSAGE!!!
  Line: 172, File: parquet/org/apache/thrift/server/TNonblockingServer.java

  protected void joinSelector() {
// wait until the selector thread exits
try {
  selectThread_.join();
} catch (InterruptedException e) {
  // for now, just silently ignore. technically this means we'll have less 
of
  // a graceful shutdown as a result.
}
  }

==

==
WARNING 2: empty handler for exception: java.net.SocketException
There are log messages..
  Line: 111, File: parquet/org/apache/thrift/transport/TNonblockingSocket.java

  public void setTimeout(int timeout) {
try {
  socketChannel_.socket().setSoTimeout(timeout);
} catch (SocketException sx) {
  LOGGER.warn(Could not set socket timeout., sx);
}
  }

==
==
WARNING 3: empty handler for exception: java.net.SocketException
There are log messages..
  Line: 103, File: parquet/org/apache/thrift/transport/TServerSocket.java

  public void listen() throws TTransportException {
// Make sure not to block on accept
if (serverSocket_ != null) {
  try {
serverSocket_.setSoTimeout(0);
  } catch (SocketException sx) {
LOGGER.error(Could not set socket timeout., sx);
  }
}
  }

==



[jira] [Updated] (SPARK-4863) Suspicious exception handlers

2014-12-16 Thread Ding Yuan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ding Yuan updated SPARK-4863:
-
Description: 
Following up with the discussion in 
https://issues.apache.org/jira/browse/SPARK-1148, I am creating a new JIRA to 
report the suspicious exception handlers detected by our tool aspirator on 
spark-1.1.1. 


{noformat}
==
WARNING: TODO;  in handler.
  Line: 129, File: org/apache/thrift/transport/TNonblockingServerSocket.java

122:  public void registerSelector(Selector selector) {
123:try {
124:  // Register the server socket channel, indicating an interest in
125:  // accepting new connections
126:  serverSocketChannel.register(selector, SelectionKey.OP_ACCEPT);
127:} catch (ClosedChannelException e) {
128:  // this shouldn't happen, ideally...
129:  // TODO: decide what to do with this.
130:}
131:  }

==

==
WARNING: TODO;  in handler.
  Line: 1583, File: org/apache/spark/SparkContext.scala

1578: val scheduler = try {
1579:   val clazz = 
Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler)
1580:   val cons = clazz.getConstructor(classOf[SparkContext])
1581:   cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
1582: } catch {
1583:   // TODO: Enumerate the exact reasons why it can fail
1584:   // But irrespective of it, it means we cannot proceed !
1585:   case e: Exception = {
1586: throw new SparkException(YARN mode not available ?, e)
1587:   }

==

==
WARNING 1: empty handler for exception: java.lang.Exception
THERE IS NO LOG MESSAGE!!!
  Line: 75, File: org/apache/spark/repl/ExecutorClassLoader.scala

try {
  val pathInDirectory = name.replace('.', '/') + .class
  val inputStream = {
if (fileSystem != null) {
  fileSystem.open(new Path(directory, pathInDirectory))
} else {
  if (SparkEnv.get.securityManager.isAuthenticationEnabled()) {
val uri = new URI(classUri + / + urlEncode(pathInDirectory))
val newuri = Utils.constructURIForAuthentication(uri, 
SparkEnv.get.securityManager)
newuri.toURL().openStream()
  } else {
new URL(classUri + / + urlEncode(pathInDirectory)).openStream()
  }
}
  }
  val bytes = readAndTransformClass(name, inputStream)
  inputStream.close()
  Some(defineClass(name, bytes, 0, bytes.length))
} catch {
  case e: Exception = None
}

==

==
WARNING 1: empty handler for exception: java.io.IOException
THERE IS NO LOG MESSAGE!!!
  Line: 275, File: org/apache/spark/util/Utils.scala

  try {
dir = new File(root, spark- + UUID.randomUUID.toString)
if (dir.exists() || !dir.mkdirs()) {
  dir = null
}
  } catch { case e: IOException = ; }

==

==
WARNING 1: empty handler for exception: java.lang.InterruptedException
THERE IS NO LOG MESSAGE!!!
  Line: 172, File: parquet/org/apache/thrift/server/TNonblockingServer.java

  protected void joinSelector() {
// wait until the selector thread exits
try {
  selectThread_.join();
} catch (InterruptedException e) {
  // for now, just silently ignore. technically this means we'll have less 
of
  // a graceful shutdown as a result.
}
  }

==

==
WARNING 2: empty handler for exception: java.net.SocketException
There are log messages..
  Line: 111, File: parquet/org/apache/thrift/transport/TNonblockingSocket.java

  public void setTimeout(int timeout) {
try {
  socketChannel_.socket().setSoTimeout(timeout);
} catch (SocketException sx) {
  LOGGER.warn(Could not set socket timeout., sx);
}
  }

==
==
WARNING 3: empty handler for exception: java.net.SocketException
There are log messages..
  Line: 103, File: parquet/org/apache/thrift/transport/TServerSocket.java

  public void listen() throws TTransportException {
// Make sure not to block on accept
if (serverSocket_ != null) {
  try {
serverSocket_.setSoTimeout(0);
  } catch (SocketException sx) {
LOGGER.error(Could not set socket timeout., sx);
  }
}
  }

==


==
WARNING 4: empty handler for exception: java.net.SocketException
There are log messages..
  Line: 70, File: 

[jira] [Commented] (SPARK-1148) Suggestions for exception handling (avoid potential bugs)

2014-12-16 Thread Ding Yuan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248707#comment-14248707
 ] 

Ding Yuan commented on SPARK-1148:
--

Glad to know these cases are addressed! As for version 0.9.0, I have reported 
all the cases (only three of them). I have opened another JIRA: 
https://issues.apache.org/jira/browse/SPARK-4863 to report the warnings on 
spark-1.1.1. 

 Suggestions for exception handling (avoid potential bugs)
 -

 Key: SPARK-1148
 URL: https://issues.apache.org/jira/browse/SPARK-1148
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Ding Yuan

 Hi Spark developers,
 We are a group of researchers on software reliability. Recently we did a 
 study and found that majority of the most severe failures in data-analytic 
 systems are caused by bugs in exception handlers – that it is hard to 
 anticipate all the possible real-world error scenarios. Therefore we built a 
 simple checking tool that automatically detects some bug patterns that have 
 caused some very severe real-world failures. I am reporting a few cases here. 
 Any feedback is much appreciated!
 Ding
 =
 Case 1:
 Line: 1249, File: org/apache/spark/SparkContext.scala
 {noformat}
 1244: val scheduler = try {
 1245:   val clazz = 
 Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler)
 1246:   val cons = clazz.getConstructor(classOf[SparkContext])
 1247:   cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
 1248: } catch {
 1249:   // TODO: Enumerate the exact reasons why it can fail
 1250:   // But irrespective of it, it means we cannot proceed !
 1251:   case th: Throwable = {
 1252: throw new SparkException(YARN mode not available ?, th)
 1253:   }
 {noformat}
 The comment suggests the specific exceptions should be enumerated here.
 The try block could throw the following exceptions:
 ClassNotFoundException
 NegativeArraySizeException
 NoSuchMethodException
 SecurityException
 InstantiationException
 IllegalAccessException
 IllegalArgumentException
 InvocationTargetException
 ClassCastException
 ==
 =
 Case 2:
 Line: 282, File: org/apache/spark/executor/Executor.scala
 {noformat}
 265: case t: Throwable = {
 266:   val serviceTime = (System.currentTimeMillis() - 
 taskStart).toInt
 267:   val metrics = attemptedTask.flatMap(t = t.metrics)
 268:   for (m - metrics) {
 269: m.executorRunTime = serviceTime
 270: m.jvmGCTime = gcTime - startGCTime
 271:   }
 272:   val reason = ExceptionFailure(t.getClass.getName, t.toString, 
 t.getStackTrace, metrics)
 273:   execBackend.statusUpdate(taskId, TaskState.FAILED, 
 ser.serialize(reason))
 274:
 275:   // TODO: Should we exit the whole executor here? On the one 
 hand, the failed task may
 276:   // have left some weird state around depending on when the 
 exception was thrown, but on
 277:   // the other hand, maybe we could detect that when future 
 tasks fail and exit then.
 278:   logError(Exception in task ID  + taskId, t)
 279:   //System.exit(1)
 280: }
 281:   } finally {
 282: // TODO: Unregister shuffle memory only for ResultTask
 283: val shuffleMemoryMap = env.shuffleMemoryMap
 284: shuffleMemoryMap.synchronized {
 285:   shuffleMemoryMap.remove(Thread.currentThread().getId)
 286: }
 287: runningTasks.remove(taskId)
 288:   }
 {noformat}
 From the comment in this Throwable exception handler it seems to suggest that 
 the system should just exit?
 ==
 =
 Case 3:
 Line: 70, File: org/apache/spark/network/netty/FileServerHandler.java
 {noformat}
 66:   try {
 67: ctx.write(new DefaultFileRegion(new FileInputStream(file)
 68:   .getChannel(), fileSegment.offset(), fileSegment.length()));
 69:   } catch (Exception e) {
 70:   LOG.error(Exception: , e);
 71:   }
 {noformat}
 Exception is too general. The try block only throws FileNotFoundException.
 Although there is nothing wrong with it now, but later if code evolves this
 might cause some other exceptions to be swallowed.
 ==



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4437) Docs for difference between WholeTextFileRecordReader and WholeCombineFileRecordReader

2014-12-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4437.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3301
[https://github.com/apache/spark/pull/3301]

 Docs for difference between WholeTextFileRecordReader and 
 WholeCombineFileRecordReader
 --

 Key: SPARK-4437
 URL: https://issues.apache.org/jira/browse/SPARK-4437
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Andrew Ash
Assignee: Davies Liu
 Fix For: 1.3.0


 Tracking per this dev@ thread:
 {quote}
 On Sun, Nov 16, 2014 at 4:49 PM, Reynold Xin r...@databricks.com wrote:
 I don't think the code is immediately obvious.
 Davies - I think you added the code, and Josh reviewed it. Can you guys
 explain and maybe submit a patch to add more documentation on the whole
 thing?
 Thanks.
 On Sun, Nov 16, 2014 at 3:22 AM, Vibhanshu Prasad vibhanshugs...@gmail.com
 wrote:
  Hello Everyone,
 
  I am going through the source code of rdd and Record readers
  There are found 2 classes
 
  1. WholeTextFileRecordReader
  2. WholeCombineFileRecordReader  ( extends CombineFileRecordReader )
 
  The description of both the classes is perfectly similar.
 
  I am not able to understand why we have 2 classes. Is
  CombineFileRecordReader providing some extra advantage?
 
  Regards
  Vibhanshu
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4855) Python tests for hypothesis testing

2014-12-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4855.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3679
[https://github.com/apache/spark/pull/3679]

 Python tests for hypothesis testing
 ---

 Key: SPARK-4855
 URL: https://issues.apache.org/jira/browse/SPARK-4855
 Project: Spark
  Issue Type: Test
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Assignee: Ben Cook
Priority: Minor
 Fix For: 1.3.0


 Add Python unit tests for Chi-Squared hypothesis testing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4839) Adding documentations about dynamic resource allocation

2014-12-16 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4839.

Resolution: Duplicate

Closing this as a duplicate

 Adding documentations about dynamic resource allocation
 ---

 Key: SPARK-4839
 URL: https://issues.apache.org/jira/browse/SPARK-4839
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Tsuyoshi OZAWA
 Fix For: 1.2.0


 There are not docs about dynamicAllocation. We should add them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4846) When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: Requested array size exceeds VM limit

2014-12-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248802#comment-14248802
 ] 

Joseph K. Bradley commented on SPARK-4846:
--

Changing vectorSize sounds too aggressive to me.  I'd vote for either the 
simple solution (throw a nice error), or an efficient method which chooses 
minCount automatically.  For the latter, this might work:

1. Try the current method.  Catch errors during the collect() for vocab and 
during the array allocations.  If there is no error, skip step 2.
2. If there is an error, do 1 pass over the data to collect stats (e.g., a 
histogram).  Use those stats to choose a reasonable minCount.  Choose the vocab 
again, etc.
3. After the big array allocations, the algorithm can continue as before.


 When the vocabulary size is large, Word2Vec may yield OutOfMemoryError: 
 Requested array size exceeds VM limit
 ---

 Key: SPARK-4846
 URL: https://issues.apache.org/jira/browse/SPARK-4846
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
 Environment: Use Word2Vec to process a corpus(sized 3.5G) with one 
 partition.
 The corpus contains about 300 million words and its vocabulary size is about 
 10 million.
Reporter: Joseph Tang
Priority: Critical

 Exception in thread Driver java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit 
 at java.util.Arrays.copyOf(Arrays.java:2271)
 at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
 at 
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1870)
 at 
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1779)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1186)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
 at 
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
 at 
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
 at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
 at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
 at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:610)
 at 
 org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:291)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:290)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4864) Add documentation to Netty-based configs

2014-12-16 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4864:
-

 Summary: Add documentation to Netty-based configs
 Key: SPARK-4864
 URL: https://issues.apache.org/jira/browse/SPARK-4864
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Currently there is no public documentation for the NettyBlockTransferService or 
various configuration options of the network package. We should add some.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4864) Add documentation to Netty-based configs

2014-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248820#comment-14248820
 ] 

Apache Spark commented on SPARK-4864:
-

User 'aarondav' has created a pull request for this issue:
https://github.com/apache/spark/pull/3713

 Add documentation to Netty-based configs
 

 Key: SPARK-4864
 URL: https://issues.apache.org/jira/browse/SPARK-4864
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Aaron Davidson
Assignee: Aaron Davidson

 Currently there is no public documentation for the NettyBlockTransferService 
 or various configuration options of the network package. We should add some.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3405) EC2 cluster creation on VPC

2014-12-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3405.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 2872
[https://github.com/apache/spark/pull/2872]

 EC2 cluster creation on VPC
 ---

 Key: SPARK-3405
 URL: https://issues.apache.org/jira/browse/SPARK-3405
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Affects Versions: 1.0.2
 Environment: Ubuntu 12.04
Reporter: Dawson Reid
Priority: Minor
 Fix For: 1.3.0


 It would be very useful to be able to specify the EC2 VPC in which the Spark 
 cluster should be created. 
 When creating a Spark cluster on AWS via the spark-ec2 script there is no way 
 to specify a VPC id of the VPC you would like the cluster to be created in. 
 The script always creates the cluster in the default VPC. 
 In my case I have deleted the default VPC and the spark-ec2 script errors out 
 with the following : 
 Setting up security groups...
 Creating security group test-master
 ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default 
 VPC for this 
 user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response
 Traceback (most recent call last):
   File ./spark_ec2.py, line 860, in module
 main()
   File ./spark_ec2.py, line 852, in main
 real_main()
   File ./spark_ec2.py, line 735, in real_main
 conn, opts, cluster_name)
   File ./spark_ec2.py, line 247, in launch_cluster
 master_group = get_or_make_group(conn, cluster_name + -master)
   File ./spark_ec2.py, line 143, in get_or_make_group
 return conn.create_security_group(name, Spark EC2 group)
   File 
 /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py,
  line 2011, in create_security_group
   File 
 /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py,
  line 925, in get_object
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default 
 VPC for this 
 user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2611) VPC Issue while creating an ec2 cluster

2014-12-16 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248836#comment-14248836
 ] 

Josh Rosen commented on SPARK-2611:
---

For the folks watching this issue: SPARK-3405 has now been merged into 
{{master}}!

 VPC Issue while creating an ec2 cluster
 ---

 Key: SPARK-2611
 URL: https://issues.apache.org/jira/browse/SPARK-2611
 Project: Spark
  Issue Type: Bug
  Components: EC2, PySpark
Affects Versions: 1.0.0
 Environment: Debian wheezy
Reporter: Anass BENSRHIR

 I'm having a critical issue while creating an ec2 cluster , here is the input 
 command and output i got :
 ./spark-ec2 -i ~/amazonhdp.pem -k amazonhdp -s 4 -t m1.small launch hadoopi
 Setting up security groups...
 Creating security group hadoopi-master
 Creating security group hadoopi-slaves
 ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidGroup.NotFound/CodeMessageThe 
 security group 'sg-2ec1b84b' does not 
 exist/Message/Error/ErrorsRequestID6554c7a8-f68a-4032-ad63-65106e2de9b3/RequestID/Response
 Traceback (most recent call last):
   File ./spark_ec2.py, line 856, in module
 main()
   File ./spark_ec2.py, line 848, in main
 real_main()
   File ./spark_ec2.py, line 731, in real_main
 conn, opts, cluster_name)
   File ./spark_ec2.py, line 252, in launch_cluster
 master_group.authorize('tcp', 50030, 50030, '0.0.0.0/0')
   File 
 /opt/spark-1.0.1-bin-hadoop2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/securitygroup.py,
  line 184, in authorize
   File 
 /opt/spark-1.0.1-bin-hadoop2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py,
  line 2181, in authorize_security_group
   File 
 /opt/spark-1.0.1-bin-hadoop2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py,
  line 944, in get_status
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidGroup.NotFound/CodeMessageThe 
 security group 'sg-2ec1b84b' does not 
 exist/Message/Error/ErrorsRequestID6554c7a8-f68a-4032-ad63-65106e2de9b3/RequestID/Response



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4865) rdds exposed to sql context via registerTempTable are not listed via thrift jdbc show tables

2014-12-16 Thread Misha Chernetsov (JIRA)
Misha Chernetsov created SPARK-4865:
---

 Summary: rdds exposed to sql context via registerTempTable are not 
listed via thrift jdbc show tables
 Key: SPARK-4865
 URL: https://issues.apache.org/jira/browse/SPARK-4865
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Misha Chernetsov






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4700) Add Http support to Spark Thrift server

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4700.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3672
[https://github.com/apache/spark/pull/3672]

 Add Http support to Spark Thrift server
 ---

 Key: SPARK-4700
 URL: https://issues.apache.org/jira/browse/SPARK-4700
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.3.0, 1.2.1
 Environment: Linux and Windows
Reporter: Judy Nash
 Fix For: 1.3.0

   Original Estimate: 48h
  Remaining Estimate: 48h

 Currently thrift only supports TCP connection. 
 The JIRA is to add HTTP support to spark thrift server in addition to the TCP 
 protocol. Both TCP and HTTP are supported by Hive today. HTTP is more secure 
 and used often in Windows. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1148) Suggestions for exception handling (avoid potential bugs)

2014-12-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14248941#comment-14248941
 ] 

Sean Owen commented on SPARK-1148:
--

[~d.yuan] I don't know that these have been addressed; some issues like it have 
been, if I recall correctly. Whatever the status, I think it's useful to focus 
on master and/or later branches. Would it be OK to close this in favor of 
SPARK-4863?

 Suggestions for exception handling (avoid potential bugs)
 -

 Key: SPARK-1148
 URL: https://issues.apache.org/jira/browse/SPARK-1148
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Ding Yuan

 Hi Spark developers,
 We are a group of researchers on software reliability. Recently we did a 
 study and found that majority of the most severe failures in data-analytic 
 systems are caused by bugs in exception handlers – that it is hard to 
 anticipate all the possible real-world error scenarios. Therefore we built a 
 simple checking tool that automatically detects some bug patterns that have 
 caused some very severe real-world failures. I am reporting a few cases here. 
 Any feedback is much appreciated!
 Ding
 =
 Case 1:
 Line: 1249, File: org/apache/spark/SparkContext.scala
 {noformat}
 1244: val scheduler = try {
 1245:   val clazz = 
 Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler)
 1246:   val cons = clazz.getConstructor(classOf[SparkContext])
 1247:   cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
 1248: } catch {
 1249:   // TODO: Enumerate the exact reasons why it can fail
 1250:   // But irrespective of it, it means we cannot proceed !
 1251:   case th: Throwable = {
 1252: throw new SparkException(YARN mode not available ?, th)
 1253:   }
 {noformat}
 The comment suggests the specific exceptions should be enumerated here.
 The try block could throw the following exceptions:
 ClassNotFoundException
 NegativeArraySizeException
 NoSuchMethodException
 SecurityException
 InstantiationException
 IllegalAccessException
 IllegalArgumentException
 InvocationTargetException
 ClassCastException
 ==
 =
 Case 2:
 Line: 282, File: org/apache/spark/executor/Executor.scala
 {noformat}
 265: case t: Throwable = {
 266:   val serviceTime = (System.currentTimeMillis() - 
 taskStart).toInt
 267:   val metrics = attemptedTask.flatMap(t = t.metrics)
 268:   for (m - metrics) {
 269: m.executorRunTime = serviceTime
 270: m.jvmGCTime = gcTime - startGCTime
 271:   }
 272:   val reason = ExceptionFailure(t.getClass.getName, t.toString, 
 t.getStackTrace, metrics)
 273:   execBackend.statusUpdate(taskId, TaskState.FAILED, 
 ser.serialize(reason))
 274:
 275:   // TODO: Should we exit the whole executor here? On the one 
 hand, the failed task may
 276:   // have left some weird state around depending on when the 
 exception was thrown, but on
 277:   // the other hand, maybe we could detect that when future 
 tasks fail and exit then.
 278:   logError(Exception in task ID  + taskId, t)
 279:   //System.exit(1)
 280: }
 281:   } finally {
 282: // TODO: Unregister shuffle memory only for ResultTask
 283: val shuffleMemoryMap = env.shuffleMemoryMap
 284: shuffleMemoryMap.synchronized {
 285:   shuffleMemoryMap.remove(Thread.currentThread().getId)
 286: }
 287: runningTasks.remove(taskId)
 288:   }
 {noformat}
 From the comment in this Throwable exception handler it seems to suggest that 
 the system should just exit?
 ==
 =
 Case 3:
 Line: 70, File: org/apache/spark/network/netty/FileServerHandler.java
 {noformat}
 66:   try {
 67: ctx.write(new DefaultFileRegion(new FileInputStream(file)
 68:   .getChannel(), fileSegment.offset(), fileSegment.length()));
 69:   } catch (Exception e) {
 70:   LOG.error(Exception: , e);
 71:   }
 {noformat}
 Exception is too general. The try block only throws FileNotFoundException.
 Although there is nothing wrong with it now, but later if code evolves this
 might cause some other exceptions to be swallowed.
 ==



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4866) Support StructType as key in MapType

2014-12-16 Thread Davies Liu (JIRA)
Davies Liu created SPARK-4866:
-

 Summary: Support StructType as key in MapType
 Key: SPARK-4866
 URL: https://issues.apache.org/jira/browse/SPARK-4866
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Davies Liu


http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Applying-schema-to-a-dictionary-with-a-Tuple-as-key-td20716.html

Hi Guys, 

Im running a spark cluster in AWS with Spark 1.1.0 in EC2 

I am trying to convert a an RDD with tuple 

(u'string', int , {(int, int): int, (int, int): int}) 

to a schema rdd using the schema: 

{code}
fields = [StructField('field1',StringType(),True), 
StructField('field2',IntegerType(),True), 

StructField('field3',MapType(StructType([StructField('field31',IntegerType(),True),
 

StructField('field32',IntegerType(),True)]),IntegerType(),True),True) 
] 

schema = StructType(fields) 
# generate the schemaRDD with the defined schema 
schemaRDD = sqc.applySchema(RDD, schema) 
{code}

But when I add field3 to the schema, it throws an execption: 

{code}
Traceback (most recent call last): 
  File stdin, line 1, in module
  File /root/spark/python/pyspark/rdd.py, line 1153, in take 
res = self.context.runJob(self, takeUpToNumLeft, p, True) 
  File /root/spark/python/pyspark/context.py, line 770, in runJob 
it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
javaPartitions, allowLocal) 
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, line 
538, in __call__ 
  File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 
300, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob. 
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 
(TID 710, ip-172-31-29-120.ec2.internal): net.razorvine.pickle.PickleException: 
couldn't introspect javabean: java.lang.IllegalArgumentException: wrong number 
of arguments 
net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) 
net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) 
net.razorvine.pickle.Pickler.save(Pickler.java:125) 
net.razorvine.pickle.Pickler.put_map(Pickler.java:321) 
net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) 
net.razorvine.pickle.Pickler.save(Pickler.java:125) 
net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) 
net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) 
net.razorvine.pickle.Pickler.save(Pickler.java:125) 
net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) 
net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) 
net.razorvine.pickle.Pickler.save(Pickler.java:125) 
net.razorvine.pickle.Pickler.dump(Pickler.java:95) 
net.razorvine.pickle.Pickler.dumps(Pickler.java:80) 

org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417)
 

org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417)
 
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 

org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:331)
 

org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
 

org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
 

org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
 
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) 

org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) 
Driver stacktrace: 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
 
at scala.Option.foreach(Option.scala:236) 
at 

[jira] [Resolved] (SPARK-1148) Suggestions for exception handling (avoid potential bugs)

2014-12-16 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1148.
--
Resolution: Duplicate

 Suggestions for exception handling (avoid potential bugs)
 -

 Key: SPARK-1148
 URL: https://issues.apache.org/jira/browse/SPARK-1148
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Ding Yuan

 Hi Spark developers,
 We are a group of researchers on software reliability. Recently we did a 
 study and found that majority of the most severe failures in data-analytic 
 systems are caused by bugs in exception handlers – that it is hard to 
 anticipate all the possible real-world error scenarios. Therefore we built a 
 simple checking tool that automatically detects some bug patterns that have 
 caused some very severe real-world failures. I am reporting a few cases here. 
 Any feedback is much appreciated!
 Ding
 =
 Case 1:
 Line: 1249, File: org/apache/spark/SparkContext.scala
 {noformat}
 1244: val scheduler = try {
 1245:   val clazz = 
 Class.forName(org.apache.spark.scheduler.cluster.YarnClusterScheduler)
 1246:   val cons = clazz.getConstructor(classOf[SparkContext])
 1247:   cons.newInstance(sc).asInstanceOf[TaskSchedulerImpl]
 1248: } catch {
 1249:   // TODO: Enumerate the exact reasons why it can fail
 1250:   // But irrespective of it, it means we cannot proceed !
 1251:   case th: Throwable = {
 1252: throw new SparkException(YARN mode not available ?, th)
 1253:   }
 {noformat}
 The comment suggests the specific exceptions should be enumerated here.
 The try block could throw the following exceptions:
 ClassNotFoundException
 NegativeArraySizeException
 NoSuchMethodException
 SecurityException
 InstantiationException
 IllegalAccessException
 IllegalArgumentException
 InvocationTargetException
 ClassCastException
 ==
 =
 Case 2:
 Line: 282, File: org/apache/spark/executor/Executor.scala
 {noformat}
 265: case t: Throwable = {
 266:   val serviceTime = (System.currentTimeMillis() - 
 taskStart).toInt
 267:   val metrics = attemptedTask.flatMap(t = t.metrics)
 268:   for (m - metrics) {
 269: m.executorRunTime = serviceTime
 270: m.jvmGCTime = gcTime - startGCTime
 271:   }
 272:   val reason = ExceptionFailure(t.getClass.getName, t.toString, 
 t.getStackTrace, metrics)
 273:   execBackend.statusUpdate(taskId, TaskState.FAILED, 
 ser.serialize(reason))
 274:
 275:   // TODO: Should we exit the whole executor here? On the one 
 hand, the failed task may
 276:   // have left some weird state around depending on when the 
 exception was thrown, but on
 277:   // the other hand, maybe we could detect that when future 
 tasks fail and exit then.
 278:   logError(Exception in task ID  + taskId, t)
 279:   //System.exit(1)
 280: }
 281:   } finally {
 282: // TODO: Unregister shuffle memory only for ResultTask
 283: val shuffleMemoryMap = env.shuffleMemoryMap
 284: shuffleMemoryMap.synchronized {
 285:   shuffleMemoryMap.remove(Thread.currentThread().getId)
 286: }
 287: runningTasks.remove(taskId)
 288:   }
 {noformat}
 From the comment in this Throwable exception handler it seems to suggest that 
 the system should just exit?
 ==
 =
 Case 3:
 Line: 70, File: org/apache/spark/network/netty/FileServerHandler.java
 {noformat}
 66:   try {
 67: ctx.write(new DefaultFileRegion(new FileInputStream(file)
 68:   .getChannel(), fileSegment.offset(), fileSegment.length()));
 69:   } catch (Exception e) {
 70:   LOG.error(Exception: , e);
 71:   }
 {noformat}
 Exception is too general. The try block only throws FileNotFoundException.
 Although there is nothing wrong with it now, but later if code evolves this
 might cause some other exceptions to be swallowed.
 ==



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4847) extraStrategies cannot take effect in SQLContext

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4847.
-
   Resolution: Fixed
Fix Version/s: 1.2.1

Issue resolved by pull request 3698
[https://github.com/apache/spark/pull/3698]

 extraStrategies cannot take effect in SQLContext
 

 Key: SPARK-4847
 URL: https://issues.apache.org/jira/browse/SPARK-4847
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Saisai Shao
 Fix For: 1.2.1


 Because strategies is initialized when SparkPlanner is created, so later 
 added extraStrategies cannot be added into strategies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4812) SparkPlan.codegenEnabled may be initialized to a wrong value

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4812.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3660
[https://github.com/apache/spark/pull/3660]

 SparkPlan.codegenEnabled may be initialized to a wrong value
 

 Key: SPARK-4812
 URL: https://issues.apache.org/jira/browse/SPARK-4812
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.3.0


 The problem is `codegenEnabled` is `val`, but it uses a `val` `sqlContext`, 
 which can be override by subclasses. Here is a simple example to show this 
 issue.
 {code}
 scala :paste
 // Entering paste mode (ctrl-D to finish)
 abstract class Foo {
   protected val sqlContext = Foo
   val codegenEnabled: Boolean = {
 println(sqlContext) // it will call subclass's `sqlContext` which has not 
 yet been initialized.
 if (sqlContext != null) {
   true
 } else {
   false
 }
   }
 }
 class Bar extends Foo {
   override val sqlContext = Bar
 }
 println(new Bar().codegenEnabled)
 // Exiting paste mode, now interpreting.
 null
 false
 defined class Foo
 defined class Bar
 scala 
 {code}
 We should make `sqlContext` `final` to prevent subclasses from overriding it 
 incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4767) Add support for launching in a specified placement group to spark ec2 scripts.

2014-12-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4767.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3623
[https://github.com/apache/spark/pull/3623]

 Add support for launching in a specified placement group to spark ec2 scripts.
 --

 Key: SPARK-4767
 URL: https://issues.apache.org/jira/browse/SPARK-4767
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: holdenk
Priority: Trivial
 Fix For: 1.3.0

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The Spark EC2 scripts don't currently allow users to specify a placement 
 group. We should add this.
 EC2 placement groups are described in 
 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html 
 This should not require updating the mesos/spark repo since its self 
 contained in spark_ec2.py.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4767) Add support for launching in a specified placement group to spark ec2 scripts.

2014-12-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4767:
--
Assignee: Holden Karau

 Add support for launching in a specified placement group to spark ec2 scripts.
 --

 Key: SPARK-4767
 URL: https://issues.apache.org/jira/browse/SPARK-4767
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: holdenk
Assignee: Holden Karau
Priority: Trivial
 Fix For: 1.3.0

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The Spark EC2 scripts don't currently allow users to specify a placement 
 group. We should add this.
 EC2 placement groups are described in 
 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html 
 This should not require updating the mesos/spark repo since its self 
 contained in spark_ec2.py.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4815) ThriftServer use only one SessionState to run sql using hive

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-4815.
---
Resolution: Duplicate

 ThriftServer use only one SessionState to run sql using hive 
 -

 Key: SPARK-4815
 URL: https://issues.apache.org/jira/browse/SPARK-4815
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: guowei

 ThriftServer use only one SessionState to run sql using hive, though it from 
 different hive sessions.
 This will make mistakes:
 For example, one user run use database in one beeline client. the database 
 in other  beeline change too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4527) Add BroadcastNestedLoopJoin operator selection testsuite

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4527.
-
   Resolution: Fixed
Fix Version/s: (was: 1.1.0)
   1.3.0

Issue resolved by pull request 3395
[https://github.com/apache/spark/pull/3395]

 Add BroadcastNestedLoopJoin operator selection testsuite
 

 Key: SPARK-4527
 URL: https://issues.apache.org/jira/browse/SPARK-4527
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.1.0
Reporter: XiaoJing wang
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 0.05h
  Remaining Estimate: 0.05h

 In `JoinSuite` add `BroadcastNestedLoopJoin` operator selection



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4837) NettyBlockTransferService does not abide by spark.blockManager.port config option

2014-12-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4837:
---
Target Version/s: 1.3.0, 1.2.1  (was: 1.2.1)

 NettyBlockTransferService does not abide by spark.blockManager.port config 
 option
 -

 Key: SPARK-4837
 URL: https://issues.apache.org/jira/browse/SPARK-4837
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker

 The NettyBlockTransferService always binds to a random port, and does not use 
 the spark.blockManager.port config as specified.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4867) UDF clean up

2014-12-16 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-4867:
---

 Summary: UDF clean up
 Key: SPARK-4867
 URL: https://issues.apache.org/jira/browse/SPARK-4867
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker


Right now our support and internal implementation of many functions has a few 
issues.  Specifically:
 - UDFS don't know their input types and thus don't do type coercion.
 - We hard code a bunch of built in functions into the parser.  This is bad 
because in SQL it creates new reserved words for things that aren't actually 
keywords.  Also it means that for each function we need to add support to both 
SQLContext and HiveContext separately.

For this JIRA I propose we do the following:
 - Change the interfaces for registerFunction and ScalaUdf to include types for 
the input arguments as well as the output type.
 - Add a rule to analysis that does type coercion for UDFs.
 - Add a parse rule for functions to SQLParser.
 - Rewrite all the UDFs that are currently hacked into the various parsers 
using this new functionality.

Depending on how big this refactoring becomes we could split parts 12 from 
part 3 above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4483) Optimization about reduce memory costs during the HashOuterJoin

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4483.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3375
[https://github.com/apache/spark/pull/3375]

 Optimization about reduce memory costs during the HashOuterJoin
 ---

 Key: SPARK-4483
 URL: https://issues.apache.org/jira/browse/SPARK-4483
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yi Tian
Priority: Minor
 Fix For: 1.3.0


 In {{HashOuterJoin.scala}}, spark read data from both side of join operation 
 before zip them together. It is a waste for memory. We are trying to read 
 data from only one side, put them into hashmap, and then generate the 
 {{JoinedRow}} with data from other side one by one.
 Currently, we could only do this optimization for {{left outer join}} and 
 {{right outer join}}. For {{full outer join}},  we will do somthing in 
 another issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4827) Max iterations (100) reached for batch Resolution with deeply nested projects and project *s

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4827.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3674
[https://github.com/apache/spark/pull/3674]

 Max iterations (100) reached for batch Resolution with deeply nested projects 
 and project *s
 

 Key: SPARK-4827
 URL: https://issues.apache.org/jira/browse/SPARK-4827
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4269) Make wait time in BroadcastHashJoin configurable

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4269.
-
   Resolution: Fixed
Fix Version/s: (was: 1.2.0)
   1.3.0

Issue resolved by pull request 3133
[https://github.com/apache/spark/pull/3133]

 Make wait time in BroadcastHashJoin configurable
 

 Key: SPARK-4269
 URL: https://issues.apache.org/jira/browse/SPARK-4269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jacky Li
 Fix For: 1.3.0


 In BroadcastHashJoin, currently it is using a hard coded value (5 minutes) to 
 wait for the execution and broadcast of the small table. 
 In my opinion, it should be a configurable value since broadcast may exceed 5 
 minutes in some case, like in a busy/congested network environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4777) Some block memory after unrollSafely not count into used memory(memoryStore.entrys or unrollMemory)

2014-12-16 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249330#comment-14249330
 ] 

SuYan commented on SPARK-4777:
--

Sean Owen, Hi, I intended to close that patch, but after having discussed with 
SPARK-3000 author, he says, current SPARK-3000 is have not merge into spark 
yet, so the problem is still in current code, so let's keep my patch still be 
open. 
If there are not a need to keep it open, tell me, I will close it, thanks.



 Some block memory after unrollSafely not count into used 
 memory(memoryStore.entrys or unrollMemory)
 ---

 Key: SPARK-4777
 URL: https://issues.apache.org/jira/browse/SPARK-4777
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: SuYan
Priority: Minor

 Some memory not count into memory used by memoryStore or unrollMemory.
 Thread A after unrollsafely memory, it will release 40MB unrollMemory(40MB 
 will used by other threads). then ThreadA wait get accountingLock to tryToPut 
 blockA(30MB). before Thread A get accountingLock, blockA memory size is not 
 counting into unrollMemory or memoryStore.currentMemory.
   
  IIUC, freeMemory should minus that block memory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3955) Different versions between jackson-mapper-asl and jackson-core-asl

2014-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249369#comment-14249369
 ] 

Apache Spark commented on SPARK-3955:
-

User 'jongyoul' has created a pull request for this issue:
https://github.com/apache/spark/pull/3716

 Different versions between jackson-mapper-asl and jackson-core-asl
 --

 Key: SPARK-3955
 URL: https://issues.apache.org/jira/browse/SPARK-3955
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Core, SQL
Affects Versions: 1.1.0
Reporter: Jongyoul Lee

 In the parent pom.xml, specified a version of jackson-mapper-asl. This is 
 used by sql/hive/pom.xml. When mvn assembly runs, however, jackson-mapper-asl 
 is not same as jackson-core-asl. This is because other libraries use several 
 versions of jackson, so other version of jackson-core-asl is assembled. 
 Simply, fix this problem if pom.xml has a specific version information of 
 jackson-core-asl. If it's not set, a version 1.9.11 is merged info 
 assembly.jar and we cannot use jackson library properly.
 {code}
 [INFO] Including org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8 in the 
 shaded jar.
 [INFO] Including org.codehaus.jackson:jackson-core-asl:jar:1.9.11 in the 
 shaded jar.
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2014-12-16 Thread Victor Tso (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249381#comment-14249381
 ] 

Victor Tso commented on SPARK-4105:
---

Similarly, I hit this in 1.1.1:

Job aborted due to stage failure: Task 0 in stage 3919.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 3919.0 (TID 2467, 
preprod-e1-sw22s.paxata.com): java.io.IOException: failed to uncompress the 
chunk: FAILED_TO_UNCOMPRESS(5)

org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:384)

java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)

java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)

java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

org.apache.spark.storage.BlockManager$LazyProxyIterator$1.hasNext(BlockManager.scala:1171)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144)
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)

org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:48)
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at 

[jira] [Comment Edited] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2014-12-16 Thread Victor Tso (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249381#comment-14249381
 ] 

Victor Tso edited comment on SPARK-4105 at 12/17/14 3:11 AM:
-

Similarly, I hit this in 1.1.1:

{code}
Job aborted due to stage failure: Task 0 in stage 3919.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 3919.0 (TID 2467, ...): 
java.io.IOException: failed to uncompress the chunk: FAILED_TO_UNCOMPRESS(5)

org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:384)

java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)

java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)

java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

org.apache.spark.storage.BlockManager$LazyProxyIterator$1.hasNext(BlockManager.scala:1171)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:144)
org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)

org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:48)
org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
{code}


was (Author: silvermast):
Similarly, I hit this in 1.1.1:

Job aborted due to stage failure: Task 0 in stage 3919.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 3919.0 (TID 2467, 
preprod-e1-sw22s.paxata.com): java.io.IOException: failed to uncompress the 
chunk: FAILED_TO_UNCOMPRESS(5)

org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:384)

java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2293)

java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2586)

java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2596)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)

java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)


[jira] [Created] (SPARK-4868) Twitter DStream.map() throws Task not serializable

2014-12-16 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-4868:
---

 Summary: Twitter DStream.map() throws Task not serializable
 Key: SPARK-4868
 URL: https://issues.apache.org/jira/browse/SPARK-4868
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Streaming
Affects Versions: 1.1.1
 Environment: * Spark 1.1.1
* EC2 cluster with 1 slave spun up using {{spark-ec2}}
* twitter4j 3.0.3
* {{spark-shell}} called with {{--jars}} argument to load 
{{spark-streaming-twitter_2.10-1.0.0.jar}} as well as all the twitter4j jars.
Reporter: Nicholas Chammas
Priority: Minor


_(Continuing the discussion [started here on the Spark user 
list|http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-in-Spark-Streaming-td5725.html].)_

The following Spark Streaming code throws a serialization exception I do not 
understand.

{code}
import twitter4j.auth.{Authorization, OAuthAuthorization}
import twitter4j.conf.ConfigurationBuilder 
import org.apache.spark.streaming.{Seconds, Minutes, StreamingContext}
import org.apache.spark.streaming.twitter.TwitterUtils

def getAuth(): Option[Authorization] = {
  System.setProperty(twitter4j.oauth.consumerKey, consumerKey)
  System.setProperty(twitter4j.oauth.consumerSecret, consumerSecret)
  System.setProperty(twitter4j.oauth.accessToken, accessToken) 
  System.setProperty(twitter4j.oauth.accessTokenSecret, accessTokenSecret)

  Some(new OAuthAuthorization(new ConfigurationBuilder().build()))
} 

def noop(a: Any): Any = {
  a
}

val ssc = new StreamingContext(sc, Seconds(5))
val liveTweetObjects = TwitterUtils.createStream(ssc, getAuth())
val liveTweets = liveTweetObjects.map(_.getText)

liveTweets.map(t = noop(t)).print()  // exception here

ssc.start()
{code}

So before I even start the StreamingContext, I get the following stack trace:

{code}
scala liveTweets.map(t = noop(t)).print()
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1264)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:438)
at $iwC$$iwC$$iwC$$iwC.init(console:27)
at $iwC$$iwC$$iwC.init(console:32)
at $iwC$$iwC.init(console:34)
at $iwC.init(console:36)
at init(console:38)
at .init(console:42)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
at 

[jira] [Commented] (SPARK-4868) Twitter DStream.map() throws Task not serializable

2014-12-16 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249387#comment-14249387
 ] 

Nicholas Chammas commented on SPARK-4868:
-

cc [~tdas], [~adav]

 Twitter DStream.map() throws Task not serializable
 

 Key: SPARK-4868
 URL: https://issues.apache.org/jira/browse/SPARK-4868
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Streaming
Affects Versions: 1.1.1
 Environment: * Spark 1.1.1
 * EC2 cluster with 1 slave spun up using {{spark-ec2}}
 * twitter4j 3.0.3
 * {{spark-shell}} called with {{--jars}} argument to load 
 {{spark-streaming-twitter_2.10-1.0.0.jar}} as well as all the twitter4j jars.
Reporter: Nicholas Chammas
Priority: Minor

 _(Continuing the discussion [started here on the Spark user 
 list|http://apache-spark-user-list.1001560.n3.nabble.com/NotSerializableException-in-Spark-Streaming-td5725.html].)_
 The following Spark Streaming code throws a serialization exception I do not 
 understand.
 {code}
 import twitter4j.auth.{Authorization, OAuthAuthorization}
 import twitter4j.conf.ConfigurationBuilder 
 import org.apache.spark.streaming.{Seconds, Minutes, StreamingContext}
 import org.apache.spark.streaming.twitter.TwitterUtils
 def getAuth(): Option[Authorization] = {
   System.setProperty(twitter4j.oauth.consumerKey, consumerKey)
   System.setProperty(twitter4j.oauth.consumerSecret, consumerSecret)
   System.setProperty(twitter4j.oauth.accessToken, accessToken) 
   System.setProperty(twitter4j.oauth.accessTokenSecret, accessTokenSecret)
   Some(new OAuthAuthorization(new ConfigurationBuilder().build()))
 } 
 def noop(a: Any): Any = {
   a
 }
 val ssc = new StreamingContext(sc, Seconds(5))
 val liveTweetObjects = TwitterUtils.createStream(ssc, getAuth())
 val liveTweets = liveTweetObjects.map(_.getText)
 liveTweets.map(t = noop(t)).print()  // exception here
 ssc.start()
 {code}
 So before I even start the StreamingContext, I get the following stack trace:
 {code}
 scala liveTweets.map(t = noop(t)).print()
 org.apache.spark.SparkException: Task not serializable
   at 
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1264)
   at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:438)
   at $iwC$$iwC$$iwC$$iwC.init(console:27)
   at $iwC$$iwC$$iwC.init(console:32)
   at $iwC$$iwC.init(console:34)
   at $iwC.init(console:36)
   at init(console:38)
   at .init(console:42)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 

[jira] [Created] (SPARK-4869) The variable names in IF statement of SQL doesn't resolve to its value.

2014-12-16 Thread Ajay (JIRA)
Ajay created SPARK-4869:
---

 Summary: The variable names in IF statement of SQL doesn't resolve 
to its value. 
 Key: SPARK-4869
 URL: https://issues.apache.org/jira/browse/SPARK-4869
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Ajay
Priority: Blocker


We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we have 
to have nested “if” statements. But, spark sql is not able to resolve the 
variable names in final evaluation but the literal values are working. Please 
fix this bug. 

This works:
sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,1) 
as ROLL_BACKWARD FROM OUTER_RDD)

This doesn’t :
sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4869) The variable names in IF statement of Spark SQL doesn't resolve to its value.

2014-12-16 Thread Ajay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay updated SPARK-4869:

Summary: The variable names in IF statement of Spark SQL doesn't resolve to 
its value.   (was: The variable names in IF statement of SQL doesn't resolve to 
its value. )

 The variable names in IF statement of Spark SQL doesn't resolve to its value. 
 --

 Key: SPARK-4869
 URL: https://issues.apache.org/jira/browse/SPARK-4869
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Ajay
Priority: Blocker

 We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we 
 have to have nested “if” statements. But, spark sql is not able to resolve 
 the variable names in final evaluation but the literal values are working. 
 Please fix this bug. 
 This works:
 sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
 0,1) as ROLL_BACKWARD FROM OUTER_RDD)
 This doesn’t :
 sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4869) The variable names in IF statement of Spark SQL doesn't resolve to its value.

2014-12-16 Thread Ajay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay updated SPARK-4869:

Description: 
We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we have 
to have nested “if” statements. But, spark sql is not able to resolve the 
variable names in final evaluation but the literal values are working. An 
Unresolved Attributes error is being thrown. Please fix this bug. 

This works:
sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,1) 
as ROLL_BACKWARD FROM OUTER_RDD)

This doesn’t :
sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD)

  was:
We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we have 
to have nested “if” statements. But, spark sql is not able to resolve the 
variable names in final evaluation but the literal values are working. Please 
fix this bug. 

This works:
sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 0,1) 
as ROLL_BACKWARD FROM OUTER_RDD)

This doesn’t :
sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD)


 The variable names in IF statement of Spark SQL doesn't resolve to its value. 
 --

 Key: SPARK-4869
 URL: https://issues.apache.org/jira/browse/SPARK-4869
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1
Reporter: Ajay
Priority: Blocker

 We got stuck with “IF-THEN” statement in Spark SQL. As per our usecase, we 
 have to have nested “if” statements. But, spark sql is not able to resolve 
 the variable names in final evaluation but the literal values are working. An 
 Unresolved Attributes error is being thrown. Please fix this bug. 
 This works:
 sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
 0,1) as ROLL_BACKWARD FROM OUTER_RDD)
 This doesn’t :
 sqlSC.sql(SELECT DISTINCT UNIT, PAST_DUE ,IF( PAST_DUE = 'CURRENT_MONTH', 
 0,DAYS_30) as ROLL_BACKWARD FROM OUTER_RDD)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4744) Short Circuit evaluation for AND OR in code gen

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4744.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3606
[https://github.com/apache/spark/pull/3606]

 Short Circuit evaluation for AND  OR in code gen
 -

 Key: SPARK-4744
 URL: https://issues.apache.org/jira/browse/SPARK-4744
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4798) Refactor Parquet test suites

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4798.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3644
[https://github.com/apache/spark/pull/3644]

 Refactor Parquet test suites
 

 Key: SPARK-4798
 URL: https://issues.apache.org/jira/browse/SPARK-4798
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.1.1, 1.2.0
Reporter: Cheng Lian
 Fix For: 1.3.0


 Current {{ParquetQuerySuite}} implementation is too verbose and is hard to 
 add new test cases. Would be good to refactor it to enable faster Parquet 
 support iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4720) Remainder should also return null if the divider is 0.

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4720.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3581
[https://github.com/apache/spark/pull/3581]

 Remainder should also return null if the divider is 0.
 --

 Key: SPARK-4720
 URL: https://issues.apache.org/jira/browse/SPARK-4720
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
 Fix For: 1.3.0


 This is a follow-up of SPARK-4593.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4866) Support StructType as key in MapType

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4866.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3714
[https://github.com/apache/spark/pull/3714]

 Support StructType as key in MapType
 

 Key: SPARK-4866
 URL: https://issues.apache.org/jira/browse/SPARK-4866
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Davies Liu
 Fix For: 1.3.0


 http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Applying-schema-to-a-dictionary-with-a-Tuple-as-key-td20716.html
 Hi Guys, 
 Im running a spark cluster in AWS with Spark 1.1.0 in EC2 
 I am trying to convert a an RDD with tuple 
 (u'string', int , {(int, int): int, (int, int): int}) 
 to a schema rdd using the schema: 
 {code}
 fields = [StructField('field1',StringType(),True), 
 StructField('field2',IntegerType(),True), 
 
 StructField('field3',MapType(StructType([StructField('field31',IntegerType(),True),
  
 
 StructField('field32',IntegerType(),True)]),IntegerType(),True),True) 
 ] 
 schema = StructType(fields) 
 # generate the schemaRDD with the defined schema 
 schemaRDD = sqc.applySchema(RDD, schema) 
 {code}
 But when I add field3 to the schema, it throws an execption: 
 {code}
 Traceback (most recent call last): 
   File stdin, line 1, in module
   File /root/spark/python/pyspark/rdd.py, line 1153, in take 
 res = self.context.runJob(self, takeUpToNumLeft, p, True) 
   File /root/spark/python/pyspark/context.py, line 770, in runJob 
 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
 javaPartitions, allowLocal) 
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py, 
 line 538, in __call__ 
   File /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 
 300, in get_return_value 
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.runJob. 
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
 28.0 (TID 710, ip-172-31-29-120.ec2.internal): 
 net.razorvine.pickle.PickleException: couldn't introspect javabean: 
 java.lang.IllegalArgumentException: wrong number of arguments 
 net.razorvine.pickle.Pickler.put_javabean(Pickler.java:603) 
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:299) 
 net.razorvine.pickle.Pickler.save(Pickler.java:125) 
 net.razorvine.pickle.Pickler.put_map(Pickler.java:321) 
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:286) 
 net.razorvine.pickle.Pickler.save(Pickler.java:125) 
 net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) 
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) 
 net.razorvine.pickle.Pickler.save(Pickler.java:125) 
 net.razorvine.pickle.Pickler.put_arrayOfObjects(Pickler.java:412) 
 net.razorvine.pickle.Pickler.dispatch(Pickler.java:195) 
 net.razorvine.pickle.Pickler.save(Pickler.java:125) 
 net.razorvine.pickle.Pickler.dump(Pickler.java:95) 
 net.razorvine.pickle.Pickler.dumps(Pickler.java:80) 
 
 org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417)
  
 
 org.apache.spark.sql.SchemaRDD$$anonfun$javaToPython$1$$anonfun$apply$2.apply(SchemaRDD.scala:417)
  
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:331)
  
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
  
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
  
 
 org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
  
 org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) 
 
 org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) 
 Driver stacktrace: 
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
  
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
  
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
  
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
 at 
 

[jira] [Commented] (SPARK-4140) Document the dynamic allocation feature

2014-12-16 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249499#comment-14249499
 ] 

Tsuyoshi OZAWA commented on SPARK-4140:
---

How about converting this issue into one of subtasks of SPARK-3174? It's easier 
to track.

 Document the dynamic allocation feature
 ---

 Key: SPARK-4140
 URL: https://issues.apache.org/jira/browse/SPARK-4140
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or

 This blocks on SPARK-3795 and SPARK-3822.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4618) Make foreign DDL commands options case-insensitive

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4618.
-
Resolution: Fixed

 Make foreign DDL commands options case-insensitive
 --

 Key: SPARK-4618
 URL: https://issues.apache.org/jira/browse/SPARK-4618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.3.0


 Make foreign DDL commands options case-insensitive
 So flowing cmd worked
 ```
   create temporary table normal_parquet
   USING org.apache.spark.sql.parquet
   OPTIONS (
 PATH '/xxx/data'
   )
 ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4618) Make foreign DDL commands options case-insensitive

2014-12-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4618:

Assignee: wangfei

 Make foreign DDL commands options case-insensitive
 --

 Key: SPARK-4618
 URL: https://issues.apache.org/jira/browse/SPARK-4618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: wangfei
 Fix For: 1.3.0


 Make foreign DDL commands options case-insensitive
 So flowing cmd worked
 ```
   create temporary table normal_parquet
   USING org.apache.spark.sql.parquet
   OPTIONS (
 PATH '/xxx/data'
   )
 ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4870) Add version information to driver log

2014-12-16 Thread Zhang, Liye (JIRA)
Zhang, Liye created SPARK-4870:
--

 Summary: Add version information to driver log
 Key: SPARK-4870
 URL: https://issues.apache.org/jira/browse/SPARK-4870
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zhang, Liye


Driver log doesn't include spark version information, version info is important 
in testing different spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4870) Add version information to driver log

2014-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249519#comment-14249519
 ] 

Apache Spark commented on SPARK-4870:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/3717

 Add version information to driver log
 -

 Key: SPARK-4870
 URL: https://issues.apache.org/jira/browse/SPARK-4870
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Zhang, Liye

 Driver log doesn't include spark version information, version info is 
 important in testing different spark version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4868) Twitter DStream.map() throws Task not serializable

2014-12-16 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249582#comment-14249582
 ] 

Nicholas Chammas commented on SPARK-4868:
-

Replacing {{noop()}} with this

{code}
object Test {
def noop(a: Any): Any = {
  a
}
}
{code}

and calling {{liveTweets.map(t = Test.noop(t)).print()}} yields a similar 
stack trace.

Creating the streaming context as follows

{code}
@transient val ssc = new StreamingContext(sc, Seconds(5))
{code}

and trying either the original {{noop()}} or {{Test.noop()}} (with REPL 
restarts in between) yields a slightly more interesting trace:

{code}
scala liveTweets.map(t = Test.noop(t)).print()  // exception here
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1264)
at org.apache.spark.streaming.dstream.DStream.map(DStream.scala:438)
at $iwC$$iwC$$iwC$$iwC.init(console:27)
at $iwC$$iwC$$iwC.init(console:32)
at $iwC$$iwC.init(console:34)
at $iwC.init(console:36)
at init(console:38)
at .init(console:42)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:823)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:868)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:780)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:625)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:633)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:638)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:963)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:911)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:911)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1006)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:329)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: Object of 
org.apache.spark.streaming.twitter.TwitterInputDStream is being serialized  
possibly as a part of closure of an RDD operation. This is because  the DStream 
object is being referred to from within the closure.  Please rewrite the RDD 
operation inside this DStream to avoid this.  This has been enforced to avoid 
bloating of Spark tasks  with unnecessary objects.
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$writeObject$1.apply$mcV$sp(DStream.scala:416)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:927)
at 
org.apache.spark.streaming.dstream.DStream.writeObject(DStream.scala:403)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at 

[jira] [Created] (SPARK-4872) Provide sample format of training/test data in MLlib programming guide

2014-12-16 Thread zhang jun wei (JIRA)
zhang jun wei created SPARK-4872:


 Summary: Provide sample format of training/test data in MLlib 
programming guide
 Key: SPARK-4872
 URL: https://issues.apache.org/jira/browse/SPARK-4872
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.1.1
Reporter: zhang jun wei


I suggest: in samples of the online programming guide of MLlib, it's better to 
give examples in the real life data, and list the translated data format for 
the model to consume. 

The problem blocking me is how to translate the real life data into the format 
which MLLib  can understand correctly. 

Here is one sample, I want to use NaiveBayes to train and predict tennis-play 
decision, the original data is:
Weather | Temperature | Humidity | Wind  = Decision to play tennis
Sunny | Hot   | High   | No = No
Sunny | Hot   | High   | Yes= No
Cloudy| Normal | Normal   | No = Yes
Rainy  | Cold | Normal   | Yes= No

Now, from my understanding, one potential translation is:
1) put every feature value word into a line:
Sunny Cloudy Rainy Hot Normal Cold High Normal Yes No
2) map them to numbers:
1 2 3 4 5 6 7 8 9 10
3) map decision labels to numbers:
0 - No
1 - Yes
4) set the value to 1 if it appears, or 0 if not, for the above example, here 
is the data format for MLUtils.loadLibSVMFile to use:
0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:0 10:1
0 1:1 2:0 3:0 4:1 5:0 6:0 7:1 8:0 9:1 10:0
1 1:0 2:1 3:0 4:0 5:1 6:0 7:0 8:1 9:0 10:1
0 1:0 2:0 3:1 4:0 5:0 6:1 7:0 8:1 9:1 10:0
== Is this a correct understanding?

And another way I can image is:
1) put every feature name into a line:
Weather  Temperature  Humidity  Wind
2) map them to numbers:
1 2 3 4 
3) map decision labels to numbers:
0 - No
1 - Yes
4) map each value of each feature to a number (e.g. Sunny to 1, Cloudy to 2, 
Rainy to 3; Hot to 1, Normal to 2, Cold to 3; High to 1, Normal to 2; Yes to 1, 
No to 2) for the above example, here is the data format for 
MLUtils.loadLibSVMFile to use:
0 1:1 2:1 3:1 4:2
0 1:1 2:1 3:1 4:1
1 1:2 2:2 3:2 4:2
0 1:3 2:3 3:2 4:1
== but when I read the source code in NaiveBayes.scala, seems this is not 
correct, I am not sure though...

So which data format translation way is correct?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3541) Improve ALS internal storage

2014-12-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249578#comment-14249578
 ] 

Apache Spark commented on SPARK-3541:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/3720

 Improve ALS internal storage
 

 Key: SPARK-3541
 URL: https://issues.apache.org/jira/browse/SPARK-3541
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
   Original Estimate: 96h
  Remaining Estimate: 96h

 The internal storage of ALS uses many small objects, which increases the GC 
 pressure and makes ALS difficult to scale to very large scale, e.g., 50 
 billion ratings. In such cases, the full GC may take more than 10 minutes to 
 finish. That is longer than the default heartbeat timeout and hence executors 
 will be removed under default settings.
 We can use primitive arrays to reduce the number of objects significantly. 
 This requires big change to the ALS implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org