date:20160111

[jira] [Assigned] (SPARK-12545) Support exists condition

2016-01-11 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12545:
--

Assignee: Davies Liu

> Support exists condition
> 
>
> Key: SPARK-12545
> URL: https://issues.apache.org/jira/browse/SPARK-12545
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12734) Fix Netty exclusions and use Maven Enforcer to prevent bug from being reintroduced

2016-01-11 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12734:
---
Fix Version/s: 1.6.1

> Fix Netty exclusions and use Maven Enforcer to prevent bug from being 
> reintroduced
> --
>
> Key: SPARK-12734
> URL: https://issues.apache.org/jira/browse/SPARK-12734
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.1, 2.0.0
>
>
> Netty classes are published under artifacts with different names, so our 
> build needs to exclude the {{io.netty:netty}} and {{org.jboss.netty:netty}} 
> versions of the Netty artifact. However, our existing exclusions were 
> incomplete, leading to situations where duplicate Netty classes would wind up 
> on the classpath and cause compile errors (or worse).
> We should fix this and should also start using Maven Enforcer's dependency 
> banning mechanisms to prevent this problem from ever being reintroduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12734) Fix Netty exclusions and use Maven Enforcer to prevent bug from being reintroduced

2016-01-11 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12734:
---
Description: 
Netty classes are published under artifacts with different names, so our build 
needs to exclude the {{org.jboss.netty:netty}} versions of the Netty artifact. 
However, our existing exclusions were incomplete, leading to situations where 
duplicate Netty classes would wind up on the classpath and cause compile errors 
(or worse).

We should fix this and should also start using Maven Enforcer's dependency 
banning mechanisms to prevent this problem from ever being reintroduced.

  was:
Netty classes are published under artifacts with different names, so our build 
needs to exclude the {{{org.jboss.netty:netty}} versions of the Netty artifact. 
However, our existing exclusions were incomplete, leading to situations where 
duplicate Netty classes would wind up on the classpath and cause compile errors 
(or worse).

We should fix this and should also start using Maven Enforcer's dependency 
banning mechanisms to prevent this problem from ever being reintroduced.


> Fix Netty exclusions and use Maven Enforcer to prevent bug from being 
> reintroduced
> --
>
> Key: SPARK-12734
> URL: https://issues.apache.org/jira/browse/SPARK-12734
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.1, 2.0.0
>
>
> Netty classes are published under artifacts with different names, so our 
> build needs to exclude the {{org.jboss.netty:netty}} versions of the Netty 
> artifact. However, our existing exclusions were incomplete, leading to 
> situations where duplicate Netty classes would wind up on the classpath and 
> cause compile errors (or worse).
> We should fix this and should also start using Maven Enforcer's dependency 
> banning mechanisms to prevent this problem from ever being reintroduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12734) Fix Netty exclusions and use Maven Enforcer to prevent bug from being reintroduced

2016-01-11 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12734:
---
Description: 
Netty classes are published under artifacts with different names, so our build 
needs to exclude the {{{org.jboss.netty:netty}} versions of the Netty artifact. 
However, our existing exclusions were incomplete, leading to situations where 
duplicate Netty classes would wind up on the classpath and cause compile errors 
(or worse).

We should fix this and should also start using Maven Enforcer's dependency 
banning mechanisms to prevent this problem from ever being reintroduced.

  was:
Netty classes are published under artifacts with different names, so our build 
needs to exclude the {{io.netty:netty}} and {{org.jboss.netty:netty}} versions 
of the Netty artifact. However, our existing exclusions were incomplete, 
leading to situations where duplicate Netty classes would wind up on the 
classpath and cause compile errors (or worse).

We should fix this and should also start using Maven Enforcer's dependency 
banning mechanisms to prevent this problem from ever being reintroduced.


> Fix Netty exclusions and use Maven Enforcer to prevent bug from being 
> reintroduced
> --
>
> Key: SPARK-12734
> URL: https://issues.apache.org/jira/browse/SPARK-12734
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.1, 2.0.0
>
>
> Netty classes are published under artifacts with different names, so our 
> build needs to exclude the {{{org.jboss.netty:netty}} versions of the Netty 
> artifact. However, our existing exclusions were incomplete, leading to 
> situations where duplicate Netty classes would wind up on the classpath and 
> cause compile errors (or worse).
> We should fix this and should also start using Maven Enforcer's dependency 
> banning mechanisms to prevent this problem from ever being reintroduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12734) Fix Netty exclusions and use Maven Enforcer to prevent bug from being reintroduced

2016-01-11 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12734:
---
Fix Version/s: 1.5.3

> Fix Netty exclusions and use Maven Enforcer to prevent bug from being 
> reintroduced
> --
>
> Key: SPARK-12734
> URL: https://issues.apache.org/jira/browse/SPARK-12734
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
>
> Netty classes are published under artifacts with different names, so our 
> build needs to exclude the {{org.jboss.netty:netty}} versions of the Netty 
> artifact. However, our existing exclusions were incomplete, leading to 
> situations where duplicate Netty classes would wind up on the classpath and 
> cause compile errors (or worse).
> We should fix this and should also start using Maven Enforcer's dependency 
> banning mechanisms to prevent this problem from ever being reintroduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12749) Spark SQL JSON schema infernce should allow parsing numbers as BigDecimal

2016-01-11 Thread Brandon Bradley (JIRA)

Brandon Bradley created SPARK-12749:
---

 Summary: Spark SQL JSON schema infernce should allow parsing 
numbers as BigDecimal
 Key: SPARK-12749
 URL: https://issues.apache.org/jira/browse/SPARK-12749
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Brandon Bradley
Priority: Minor


Hello,

Spark SQL JSON parsing has no options for producing BigDecimal from 
floating-point values. I think an option should be added to 
`org.apache.spark.sql.execution.datasources.json.JSONOptions` to allow for 
this. I can provide a patch if an option name can be agreed upon.

I'm thinking something like `floatAsBigDecimal`.

Cheers!
Brandon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12750) Java class method don't work properly

2016-01-11 Thread Gramce (JIRA)

Gramce created SPARK-12750:
--

 Summary: Java class method don't work properly
 Key: SPARK-12750
 URL: https://issues.apache.org/jira/browse/SPARK-12750
 Project: Spark
  Issue Type: Question
Reporter: Gramce


I use java spark to tansform the labeledpoint.
I want to select several columns from the JavaRdd. For example 
the first three colunmns.
So I wrote like this:
int[] ad={1,2,3};
int b=ad.length;  
JavaRDD ggd=parsedData.map(
new Function(){
public LabeledPoint call(LabeledPoint a){
double[] v =new double[b];
for(int i=0;i abcd;
public myrddd(JavaRDD deff ){
abcd=deff;
}
public JavaRDD abcdf(int[]asdf,int b){
JavaRDD bcd=abcd;
JavaRDD mms=bcd.map(
new Function(){
public LabeledPoint call(LabeledPoint a){
double[] v =new double[b];
for(int i=0;i ggdf=ndfs.abcdf(ad, b);

But this doesn't work.Following is the error:

Exception in thread "main" org.apache.spark.SparkException: Task not 
serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
at org.apache.spark.rdd.RDD.map(RDD.scala:317)
at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:93)
at 
org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:47)
at anbv.qwe.myrddd.abcdf(dfa.java:53)
at anbv.qwe.dfa.main(dfa.java:42)
Caused by: java.io.NotSerializableException: anbv.qwe.myrddd
Serialization stack:
- object not serializable (class: anbv.qwe.myrddd, value: 
anbv.qwe.myrddd@310aee0b)
- field (class: anbv.qwe.myrddd$1, name: this$0, type: class 
anbv.qwe.myrddd)
- object (class anbv.qwe.myrddd$1, anbv.qwe.myrddd$1@4b76aa5a)
- field (class: 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: fun$1, 
type: interface org.apache.spark.api.java.function.Function)
- object (class 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, )
at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at 
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 13 more
but this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12749) Spark SQL JSON schema infernce should allow floating-point numbers as BigDecimal

2016-01-11 Thread Brandon Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Bradley updated SPARK-12749:

Summary: Spark SQL JSON schema infernce should allow floating-point numbers 
as BigDecimal  (was: Spark SQL JSON schema infernce should allow parsing 
numbers as BigDecimal)

> Spark SQL JSON schema infernce should allow floating-point numbers as 
> BigDecimal
> 
>
> Key: SPARK-12749
> URL: https://issues.apache.org/jira/browse/SPARK-12749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Brandon Bradley
>Priority: Minor
>
> Hello,
> Spark SQL JSON parsing has no options for producing BigDecimal from 
> floating-point values. I think an option should be added to 
> `org.apache.spark.sql.execution.datasources.json.JSONOptions` to allow for 
> this. I can provide a patch if an option name can be agreed upon.
> I'm thinking something like `floatAsBigDecimal`.
> Cheers!
> Brandon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12749) Spark SQL JSON schema infernce should allow floating-point numbers as BigDecimal

2016-01-11 Thread Brandon Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Bradley updated SPARK-12749:

Description: 
Hello,

Spark SQL JSON parsing has no options for producing BigDecimal from 
floating-point values. I think an option should be added to 
org.apache.spark.sql.execution.datasources.json.JSONOptions to allow for this. 
I can provide a patch if an option name can be agreed upon.

I'm thinking something like floatAsBigDecimal.

Cheers!
Brandon

  was:
Hello,

Spark SQL JSON parsing has no options for producing BigDecimal from 
floating-point values. I think an option should be added to 
`org.apache.spark.sql.execution.datasources.json.JSONOptions` to allow for 
this. I can provide a patch if an option name can be agreed upon.

I'm thinking something like `floatAsBigDecimal`.

Cheers!
Brandon


> Spark SQL JSON schema infernce should allow floating-point numbers as 
> BigDecimal
> 
>
> Key: SPARK-12749
> URL: https://issues.apache.org/jira/browse/SPARK-12749
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Brandon Bradley
>Priority: Minor
>
> Hello,
> Spark SQL JSON parsing has no options for producing BigDecimal from 
> floating-point values. I think an option should be added to 
> org.apache.spark.sql.execution.datasources.json.JSONOptions to allow for 
> this. I can provide a patch if an option name can be agreed upon.
> I'm thinking something like floatAsBigDecimal.
> Cheers!
> Brandon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12260) Graceful Shutdown with In-Memory State

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12260.
---
Resolution: Won't Fix

> Graceful Shutdown with In-Memory State
> --
>
> Key: SPARK-12260
> URL: https://issues.apache.org/jira/browse/SPARK-12260
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mao, Wei
>  Labels: streaming
>
> Users often stop and restart their streaming jobs for tasks such as 
> maintenance, software upgrades or even application logic updates. When a job 
> re-starts it should pick up where it left off i.e. any state information that 
> existed when the job stopped should be used as the initial state when the job 
> restarts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12750) Java class method don't work properly

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12750.
---
Resolution: Not A Problem

The problem is as it says. You've written Java code such that your Function 
retains a reference to a non-serializable object.

> Java class method don't work properly
> -
>
> Key: SPARK-12750
> URL: https://issues.apache.org/jira/browse/SPARK-12750
> Project: Spark
>  Issue Type: Question
>Reporter: Gramce
>
> I use java spark to tansform the labeledpoint.
> I want to select several columns from the JavaRdd. For example 
> the first three colunmns.
> So I wrote like this:
> int[] ad={1,2,3};
> int b=ad.length;  
> JavaRDD ggd=parsedData.map(
>   new Function(){
>   public LabeledPoint call(LabeledPoint a){
>   double[] v =new double[b];
>   for(int i=0;i   
> v[i]=a.features().toArray()[ad[i]];
>   }
>   return new 
> LabeledPoint(a.label(),Vectors.dense(v));
>   }   
>   });
> where parsedData is a LabeledPoint data.
> Now I want to converse this to a method. So the code is like this:
> class myrddd{
>   public JavaRDD abcd;
>   public myrddd(JavaRDD deff ){
>   abcd=deff;
>   }
>   public JavaRDD abcdf(int[]asdf,int b){
>   JavaRDD bcd=abcd;
>   JavaRDD mms=bcd.map(
>   new Function(){
>   public LabeledPoint call(LabeledPoint a){
>   double[] v =new double[b];
>   for(int i=0;i   
> v[i]=a.features().toArray()[asdf[i]];
>   }
>   return new 
> LabeledPoint(a.label(),Vectors.dense(v));
>   }   
>   });
>   return(mms);}
> }
> And
> myrddd ndfs=new myrddd(parsedData);
> JavaRDD ggdf=ndfs.abcdf(ad, b);
> But this doesn't work.Following is the error:
> Exception in thread "main" org.apache.spark.SparkException: Task not 
> serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
>   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:318)
>   at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:317)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:317)
>   at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:93)
>   at 
> org.apache.spark.api.java.AbstractJavaRDDLike.map(JavaRDDLike.scala:47)
>   at anbv.qwe.myrddd.abcdf(dfa.java:53)
>   at anbv.qwe.dfa.main(dfa.java:42)
> Caused by: java.io.NotSerializableException: anbv.qwe.myrddd
> Serialization stack:
>   - object not serializable (class: anbv.qwe.myrddd, value: 
> anbv.qwe.myrddd@310aee0b)
>   - field (class: anbv.qwe.myrddd$1, name: this$0, type: class 
> anbv.qwe.myrddd)
>   - object (class anbv.qwe.myrddd$1, anbv.qwe.myrddd$1@4b76aa5a)
>   - field (class: 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, name: 
> fun$1, type: interface org.apache.spark.api.java.function.Function)
>   - object (class 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1, )
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
>   ... 13 more
> but this



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12642) improve the hash expression to be decoupled from unsafe row

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091659#comment-15091659
 ] 

Apache Spark commented on SPARK-12642:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10694

> improve the hash expression to be decoupled from unsafe row 
> 
>
> Key: SPARK-12642
> URL: https://issues.apache.org/jira/browse/SPARK-12642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> For example, if the row is (int, double, string): the generated hash function 
> shoudl be something like
> int hash = seed;
> hash = murmur3(getInt(0), hash)
> hash = murmur3(getDouble(1), hash)
> hash = murmur3(getString(2), hash)
> return hash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12642) improve the hash expression to be decoupled from unsafe row

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12642:


Assignee: Apache Spark

> improve the hash expression to be decoupled from unsafe row 
> 
>
> Key: SPARK-12642
> URL: https://issues.apache.org/jira/browse/SPARK-12642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>
> For example, if the row is (int, double, string): the generated hash function 
> shoudl be something like
> int hash = seed;
> hash = murmur3(getInt(0), hash)
> hash = murmur3(getDouble(1), hash)
> hash = murmur3(getString(2), hash)
> return hash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12642) improve the hash expression to be decoupled from unsafe row

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12642:


Assignee: (was: Apache Spark)

> improve the hash expression to be decoupled from unsafe row 
> 
>
> Key: SPARK-12642
> URL: https://issues.apache.org/jira/browse/SPARK-12642
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> For example, if the row is (int, double, string): the generated hash function 
> shoudl be something like
> int hash = seed;
> hash = murmur3(getInt(0), hash)
> hash = murmur3(getDouble(1), hash)
> hash = murmur3(getString(2), hash)
> return hash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12269) Update aws-java-sdk version

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12269.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10256
[https://github.com/apache/spark/pull/10256]

> Update aws-java-sdk version
> ---
>
> Key: SPARK-12269
> URL: https://issues.apache.org/jira/browse/SPARK-12269
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Brian London
>Assignee: Brian London
>Priority: Minor
> Fix For: 2.0.0
>
>
> The current Spark Streaming kinesis connector references a quite old version 
> 1.9.40 of the AWS Java SDK (1.10.40 is current).  Numerous AWS features 
> including Kinesis Firehose are unavailable in 1.9.  Those two versions of  
> the AWS SDK in turn require conflicting versions of Jackson (2.4.4 and 2.5.3 
> respectively) such that one cannot include the current AWS SDK in a project 
> that also uses the Spark Streaming Kinesis ASL.
> Bumping the version of Jackson and the AWS library solves this problem and 
> will allow Firehose integrations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3369) Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3369:
-
Attachment: (was: FlatMapIterator.patch)

> Java mapPartitions Iterator->Iterable is inconsistent with Scala's 
> Iterator->Iterator
> -
>
> Key: SPARK-3369
> URL: https://issues.apache.org/jira/browse/SPARK-3369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API
>Affects Versions: 1.0.2, 1.2.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>  Labels: breaking_change, releasenotes
>
> {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
> {{Iterator}} to an {{Iterator}}: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
> In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
> an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
> stronger condition and appears inconsistent. It's a problematic inconsistent 
> though because this seems to require copying all of the input into memory in 
> order to create an object that can be iterated many times, since the input 
> does not afford this itself.
> Similarity for other {{mapPartitions*}} methods and other 
> {{*FlatMapFunctions}}s in Java.
> (Is there a reason for this difference that I'm overlooking?)
> If I'm right that this was inadvertent inconsistency, then the big issue here 
> is that of course this is part of a public API. Workarounds I can think of:
> Promise that Spark will only call {{iterator()}} once, so implementors can 
> use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
> Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
> desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3369) Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3369:
-
Comment: was deleted

(was: Attaching only a patch for now, not a PR, to demonstrate the extent of 
the change if Iterable were to be directly changed to Iterator.

See also 
https://github.com/srowen/spark/commit/496b84ad47052af10d1d6055d45ff8782f502b59)

> Java mapPartitions Iterator->Iterable is inconsistent with Scala's 
> Iterator->Iterator
> -
>
> Key: SPARK-3369
> URL: https://issues.apache.org/jira/browse/SPARK-3369
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API
>Affects Versions: 1.0.2, 1.2.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>  Labels: breaking_change, releasenotes
>
> {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
> {{Iterator}} to an {{Iterator}}: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
> In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
> an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
> stronger condition and appears inconsistent. It's a problematic inconsistent 
> though because this seems to require copying all of the input into memory in 
> order to create an object that can be iterated many times, since the input 
> does not afford this itself.
> Similarity for other {{mapPartitions*}} methods and other 
> {{*FlatMapFunctions}}s in Java.
> (Is there a reason for this difference that I'm overlooking?)
> If I'm right that this was inadvertent inconsistency, then the big issue here 
> is that of course this is part of a public API. Workarounds I can think of:
> Promise that Spark will only call {{iterator()}} once, so implementors can 
> use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
> Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
> desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4970) Do not read spark.executor.memory from spark-defaults.conf in SparkSubmitSuite

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4970.
--
Resolution: Not A Problem

> Do not read spark.executor.memory from spark-defaults.conf in SparkSubmitSuite
> --
>
> Key: SPARK-4970
> URL: https://issues.apache.org/jira/browse/SPARK-4970
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The test 'includes jars passed in through --jars’ in SparkSubmitSuite fails
> when spark.executor.memory is set at over 512MiB in conf/spark-default.conf.
> An exception is thrown as follows:
> Exception in thread "main" org.apache.spark.SparkException: Asked to launch 
> cluster with 512 MB RAM / worker but requested 1024 MB/worker
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:1889)
>   at org.apache.spark.SparkContext.(SparkContext.scala:322)
>   at 
> org.apache.spark.deploy.JarCreationTest$.main(SparkSubmitSuite.scala:458)
>   at org.apache.spark.deploy.JarCreationTest.main(SparkSubmitSuite.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:367)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9825) Spark overwrites remote cluster "final" properties with local config

2016-01-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091687#comment-15091687
 ] 

Sean Owen commented on SPARK-9825:
--

I don't believe Spark modifies any of these settings. Is that even possible in 
the {{Configuration}} object? it is however possible that something somewhere 
is managing to create a config without, somehow, the defaults configured in 
these files.

> Spark overwrites remote cluster "final" properties with local config 
> -
>
> Key: SPARK-9825
> URL: https://issues.apache.org/jira/browse/SPARK-9825
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Rok Roskar
>
> Configuration options specified in the hadoop cluster *.xml config files can 
> be marked as "final", indicating that they should not be overwritten by a 
> client's configuration. Spark appears to be over-writing those options, the 
> symptom of which is that local proxy settings overwrite the cluster-side 
> proxy settings. This breaks things when trying to run jobs on a remote, 
> firewalled, YARN cluster. 
> For example, with the configuration below, one should be able to establish a 
> SOCKS proxy via ssh -D to a host that can "see" the cluster, and then submit 
> jobs and run the driver on the local desktop/laptop:
> Remote cluster-side core-site.xml:
> {code:xml}
> 
> hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.StandardSocketFactory
>   true
> 
> {code}
> This configuration ensures that the nodes within the cluster never use a 
> proxy to talk to each other.
> Local client-side core-site.xml:
> {code:xml}
> 
>   hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.SocksSocketFactory
> 
> 
> hadoop.socks.server
> localhost:
> 
> {code}
> Indeed, running a standard MapReduce job, the log files show that an override 
> of a property marked  is attempted: 
> {code}
> 2015-07-27 15:26:11,706 WARN [main] org.apache.hadoop.conf.Configuration: 
> job.xml:an attempt to override final parameter: 
> hadoop.rpc.socket.factory.class.default;  Ignoring.
> {code}
> and the MR job proceeds and finishes normally. 
> On the other hand, a Spark job with the same configuration shows no such 
> message and instead we see that the nodes within the cluster are not able to 
> communicate: 
> {code}
> 15/07/27 15:25:43 INFO client.RMProxy: Connecting to ResourceManager at 
> node1/10.211.55.101:8030
> 15/07/27 15:25:43 INFO yarn.YarnRMClient: Registering the ApplicationMaster
> 15/07/27 15:25:44 INFO ipc.Client: Retrying connect to server: 
> node1/10.211.55.101:8030. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> {code}
> Running tcpdump on the slave nodes shows that in the case of the MR job, 
> packets are sent between slave nodes and the ResourceManager node  indicating 
> that no proxy is being used, while in the case of the Spark job no such 
> connection is made. 
> A further indication that the cluster-side configuration is altered is that 
> if a dedicated proxy server is set up in a way that both sides can see it, 
> i.e. the local core-site.xml is changed to have
> {code:xml}
> 
> hadoop.socks.server
> node2:
> 
> {code}
> the Spark job (and the MR job) run fine, with all connections going through 
> the dedicated proxy server. While this works, it's sub-optimal because it now 
> requires that such a server be created, which may not always be possible 
> because it requires privileged access to the gateway machine. 
> Therefore, it appears that Spark is perfectly happy running through a proxy 
> in YARN mode, but that it garbles the cluster-side configuration even when 
> properties are marked as {{}}. I'm not sure if this is intended? Or is 
> there some other way that preserving the "final" properties can be enforced?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12743) spark.executor.memory is ignored by spark-submit in Standalone Cluster mode

2016-01-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091702#comment-15091702
 ] 

Sean Owen commented on SPARK-12743:
---

I can't reproduce this; I started a simple master/executor on my local machine 
and ran spark-submit as you suggest and found I got exactly the executor memory 
I requested. I asked for 1g whereas my local conf/spark-defaults.conf specified 
2g. What else might be different / are you sure?

> spark.executor.memory is ignored by spark-submit in Standalone Cluster mode
> ---
>
> Key: SPARK-12743
> URL: https://issues.apache.org/jira/browse/SPARK-12743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
>Reporter: Alan Braithwaite
>
> When using spark-submit in standalone cluster mode, `--conf 
> spark.executor.memory=Xg` is ignored.  Instead, the value in 
> spark-defaults.conf on the standalone master is used.
> Using the legacy submission gateway as well, if that affects this (we're in 
> the process of setting up the REST gateway).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12746) ArrayType(_, true) should also accept ArrayType(_, false)

2016-01-11 Thread Earthson Lu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Earthson Lu updated SPARK-12746:

Description: 
I see CountVectorizer has schema check for ArrayType which has 
ArrayType(StringType, true). 

ArrayType(String, false) is just a special case of ArrayType(String, true), but 
it will not pass this type check.

  was:
I see CountVectorizer has schema check for ArrayType which has 
ArrayType(StringType, true). 

ArrayType(String, false) is just a special case of ArrayType(String, false), 
but it will not pass this type check.


> ArrayType(_, true) should also accept ArrayType(_, false)
> -
>
> Key: SPARK-12746
> URL: https://issues.apache.org/jira/browse/SPARK-12746
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SQL
>Affects Versions: 1.6.0
>Reporter: Earthson Lu
>
> I see CountVectorizer has schema check for ArrayType which has 
> ArrayType(StringType, true). 
> ArrayType(String, false) is just a special case of ArrayType(String, true), 
> but it will not pass this type check.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9825) Spark overwrites remote cluster "final" properties with local config

2016-01-11 Thread Rok Roskar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091718#comment-15091718
 ] 

Rok Roskar commented on SPARK-9825:
---

I'm not sure who has the responsibility to honor the "final" property flag -- 
client or cluster side? If the "final" designation is ignored in general it has 
the potential to be problematic in general not just in this use-case. 

> Spark overwrites remote cluster "final" properties with local config 
> -
>
> Key: SPARK-9825
> URL: https://issues.apache.org/jira/browse/SPARK-9825
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Rok Roskar
>
> Configuration options specified in the hadoop cluster *.xml config files can 
> be marked as "final", indicating that they should not be overwritten by a 
> client's configuration. Spark appears to be over-writing those options, the 
> symptom of which is that local proxy settings overwrite the cluster-side 
> proxy settings. This breaks things when trying to run jobs on a remote, 
> firewalled, YARN cluster. 
> For example, with the configuration below, one should be able to establish a 
> SOCKS proxy via ssh -D to a host that can "see" the cluster, and then submit 
> jobs and run the driver on the local desktop/laptop:
> Remote cluster-side core-site.xml:
> {code:xml}
> 
> hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.StandardSocketFactory
>   true
> 
> {code}
> This configuration ensures that the nodes within the cluster never use a 
> proxy to talk to each other.
> Local client-side core-site.xml:
> {code:xml}
> 
>   hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.SocksSocketFactory
> 
> 
> hadoop.socks.server
> localhost:
> 
> {code}
> Indeed, running a standard MapReduce job, the log files show that an override 
> of a property marked  is attempted: 
> {code}
> 2015-07-27 15:26:11,706 WARN [main] org.apache.hadoop.conf.Configuration: 
> job.xml:an attempt to override final parameter: 
> hadoop.rpc.socket.factory.class.default;  Ignoring.
> {code}
> and the MR job proceeds and finishes normally. 
> On the other hand, a Spark job with the same configuration shows no such 
> message and instead we see that the nodes within the cluster are not able to 
> communicate: 
> {code}
> 15/07/27 15:25:43 INFO client.RMProxy: Connecting to ResourceManager at 
> node1/10.211.55.101:8030
> 15/07/27 15:25:43 INFO yarn.YarnRMClient: Registering the ApplicationMaster
> 15/07/27 15:25:44 INFO ipc.Client: Retrying connect to server: 
> node1/10.211.55.101:8030. Already tried 0 time(s); retry policy is 
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
> MILLISECONDS)
> {code}
> Running tcpdump on the slave nodes shows that in the case of the MR job, 
> packets are sent between slave nodes and the ResourceManager node  indicating 
> that no proxy is being used, while in the case of the Spark job no such 
> connection is made. 
> A further indication that the cluster-side configuration is altered is that 
> if a dedicated proxy server is set up in a way that both sides can see it, 
> i.e. the local core-site.xml is changed to have
> {code:xml}
> 
> hadoop.socks.server
> node2:
> 
> {code}
> the Spark job (and the MR job) run fine, with all connections going through 
> the dedicated proxy server. While this works, it's sub-optimal because it now 
> requires that such a server be created, which may not always be possible 
> because it requires privileged access to the gateway machine. 
> Therefore, it appears that Spark is perfectly happy running through a proxy 
> in YARN mode, but that it garbles the cluster-side configuration even when 
> properties are marked as {{}}. I'm not sure if this is intended? Or is 
> there some other way that preserving the "final" properties can be enforced?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12747) Postgres JDBC ArrayType(DoubleType) 'Unable to find server array type'

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12747:


Assignee: (was: Apache Spark)

> Postgres JDBC ArrayType(DoubleType) 'Unable to find server array type'
> --
>
> Key: SPARK-12747
> URL: https://issues.apache.org/jira/browse/SPARK-12747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Brandon Bradley
>  Labels: JDBC
>
> Hello,
> I'm getting this exception when trying to use DataFrame.jdbc.write on a 
> DataFrame with column ArrayType(DoubleType).
> {noformat}
> org.postgresql.util.PSQLException: Unable to find server array type for 
> provided name double precision
> {noformat}
> Driver is definitely on the driver and executor classpath as I have other 
> code that works without ArrayType. I'm not sure how to proceed in debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12529) Spark streaming: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12529.
---
Resolution: Cannot Reproduce

> Spark streaming: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY
> ---
>
> Key: SPARK-12529
> URL: https://issues.apache.org/jira/browse/SPARK-12529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
> Environment: MacOSX Standalone
>Reporter: Brad Cox
>
> Posted originally on stackoverflow. Reposted here on request by Josh Rosen.
> I'm trying to start spark streaming in standalone mode (MacOSX) and getting 
> the following error nomatter what:
> Exception in thread "main" java.lang.ExceptionInInitializerError at 
> org.apache.spark.storage.DiskBlockManager.addShutdownHook(DiskBlockManager.scala:147)
>  at org.apache.spark.storage.DiskBlockManager.(DiskBlockManager.scala:54) at 
> org.apache.spark.storage.BlockManager.(BlockManager.scala:75) at 
> org.apache.spark.storage.BlockManager.(BlockManager.scala:173) at 
> org.apache.spark.SparkEnv$.create(SparkEnv.scala:347) at 
> org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:194) at 
> org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277) at 
> org.apache.spark.SparkContext.(SparkContext.scala:450) at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:566)
>  at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:578)
>  at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:90) 
> at 
> org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:78)
>  at io.ascolta.pcap.PcapOfflineReceiver.main(PcapOfflineReceiver.java:103) 
> Caused by: java.lang.NoSuchFieldException: SHUTDOWN_HOOK_PRIORITY at 
> java.lang.Class.getField(Class.java:1584) at 
> org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:220)
>  at 
> org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
>  at 
> org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
>  at 
> org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:189)
>  at org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala:58) 
> at org.apache.spark.util.ShutdownHookManager$.(ShutdownHookManager.scala) ... 
> 13 more
> This symptom is discussed in relation to EC2 at 
> https://forums.databricks.com/questions/2227/shutdown-hook-priority-javalangnosuchfieldexceptio.html
>  as a Hadoop2 dependency. But I'm running locally (for now), and am using the 
> spark-1.5.2-bin-hadoop2.6.tgz binary from 
> https://spark.apache.org/downloads.html which I'd hoped would eliminate this 
> possibility.
> I've pruned my code down to essentially nothing; like this:
> SparkConf conf = new SparkConf()
>   .setAppName(appName)
>   .setMaster(master);
>   JavaStreamingContext ssc = new JavaStreamingContext(conf, new 
> Duration(1000));
> I've permuted maven dependencies to ensure all spark stuff is consistent at 
> version 1.5.2. Yet the ssc initialization above fails nomatter what. So I 
> thought it was time to ask for help.
> Build environment is eclipse and maven with the shade plugin. Launch/run is 
> from eclipse debugger, not spark-submit, for now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12747) Postgres JDBC ArrayType(DoubleType) 'Unable to find server array type'

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091722#comment-15091722
 ] 

Apache Spark commented on SPARK-12747:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/10695

> Postgres JDBC ArrayType(DoubleType) 'Unable to find server array type'
> --
>
> Key: SPARK-12747
> URL: https://issues.apache.org/jira/browse/SPARK-12747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Brandon Bradley
>  Labels: JDBC
>
> Hello,
> I'm getting this exception when trying to use DataFrame.jdbc.write on a 
> DataFrame with column ArrayType(DoubleType).
> {noformat}
> org.postgresql.util.PSQLException: Unable to find server array type for 
> provided name double precision
> {noformat}
> Driver is definitely on the driver and executor classpath as I have other 
> code that works without ArrayType. I'm not sure how to proceed in debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12747) Postgres JDBC ArrayType(DoubleType) 'Unable to find server array type'

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12747:


Assignee: Apache Spark

> Postgres JDBC ArrayType(DoubleType) 'Unable to find server array type'
> --
>
> Key: SPARK-12747
> URL: https://issues.apache.org/jira/browse/SPARK-12747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Brandon Bradley
>Assignee: Apache Spark
>  Labels: JDBC
>
> Hello,
> I'm getting this exception when trying to use DataFrame.jdbc.write on a 
> DataFrame with column ArrayType(DoubleType).
> {noformat}
> org.postgresql.util.PSQLException: Unable to find server array type for 
> provided name double precision
> {noformat}
> Driver is definitely on the driver and executor classpath as I have other 
> code that works without ArrayType. I'm not sure how to proceed in debugging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5883) Add compression scheme in VertexAttributeBlock for shipping vertices to edge partitions

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5883.
--
Resolution: Won't Fix

> Add compression scheme in VertexAttributeBlock for shipping vertices to edge 
> partitions
> ---
>
> Key: SPARK-5883
> URL: https://issues.apache.org/jira/browse/SPARK-5883
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>
> The size of shipped data between vertex partitions and edge partitions
> is one of major issues for better performance.
> SPAR-3649 indicated the ~10% performance gain in Pregel iterations
> by using the custom serializers for ShuffledRDD.
> However, it is kind of tough to implement efficient serializers for 
> ShuffledRDD
> inside GraphX because 1)how to use serializers in ShuffledRDD is different
> between SortShuffleManager and HashShuffleManager (See SPARK-3649)
> and 2)the type of 'VD' is unknown to GraphX.
> Therefore, I think that compressing shippded data inside GraphX
> (before they are passed into ShuffleRDD) is one of better solutions for that.
> GraphX users register user-defined serializer for VD, and then
> GraphX uses the serializer so as to compress shipped data between
> vertex partitions and edge ones.
> My current patch applies this idea in ReplicatedVertexView#upgrade
> and ReplicatedVertexView#updateVertices.
> https://github.com/maropu/spark/commit/665b6c4a273b90e7c6e1545f982c7576a0e5ceb2
> Also, it can be applied into ReplicatedVertexView#withActiveSet
> and VertexRDDImpl#aggregateUsingIndex.
> I'm not sure that this design is acceptable, so any advice welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5876) generalize the type of categoricalFeaturesInfo to PartialFunction[Int, Int]

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5876.
--
Resolution: Won't Fix

> generalize the type of categoricalFeaturesInfo to PartialFunction[Int, Int]
> ---
>
> Key: SPARK-5876
> URL: https://issues.apache.org/jira/browse/SPARK-5876
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Erik Erlandson
>Priority: Minor
>  Labels: easyfix
>
> The decision tree training takes a parameter {{categoricalFeaturesInfo}} of 
> type {{Map\[Int,Int\]}} that encodes information about any features that are 
> categories and how many categorical values are present.
> It would be useful to generalize this type to its superclass 
> {{PartialFunction\[Int,Int\]}}, which would be backward compatible with 
> {{Map\[Int,Int\]}}, but can also accept a {{Seq\[Int\]}}, or any other 
> partial function.
> Would need to verify that any tests for key definition in the mapping are 
> using {{isDefinedAt(key)}} instead of {{contains(key)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6068) KMeans Parallel test may fail

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6068.
--
Resolution: Won't Fix

> KMeans Parallel test may fail
> -
>
> Key: SPARK-6068
> URL: https://issues.apache.org/jira/browse/SPARK-6068
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Tests
>Affects Versions: 1.2.1
>Reporter: Derrick Burns
>Priority: Minor
>  Labels: clustering
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The test  "k-means|| initialization in KMeansSuite can fail when the random 
> number generator is truly random.
> The test is predicated on the assumption that each round of K-Means || will 
> add at least one new cluster center.  The current implementation of K-Means 
> || adds 2*k cluster centers with high probability.  However, there is no 
> deterministic lower bound on the number of cluster centers added.
> Choices are:
> 1)  change the KMeans || implementation to iterate on selecting points until 
> it has satisfied a lower bound on the number of points chosen.
> 2) eliminate the test
> 3) ignore the problem and depend on the random number generator to sample the 
> space in a lucky manner. 
> Option (1) is most in keeping with the contract that KMeans || should provide 
> a precise number of cluster centers when possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3878) Benchmarks and common tests for mllib algorithm

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3878.
--
Resolution: Won't Fix

> Benchmarks and common tests for mllib algorithm
> ---
>
> Key: SPARK-3878
> URL: https://issues.apache.org/jira/browse/SPARK-3878
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Egor Pakhomov
>Assignee: Egor Pakhomov
>
> There is no common practice among MLlib for testing algorithms: every model 
> generates it's own random test data. There is no easy extractable test cases 
> applible to another algorithm. There is no benchmarks for comparing 
> algorithms. After implementing new algorithm it's very hard to understand how 
> it should be tested. 
> Lack of serialization testing: MLlib algorithms don't contain tests which 
> test that model work after serialization. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6425) Add parallel Q-learning algorithm to MLLib

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6425.
--
Resolution: Won't Fix

> Add parallel Q-learning algorithm to MLLib
> --
>
> Key: SPARK-6425
> URL: https://issues.apache.org/jira/browse/SPARK-6425
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: zhangyouhua
>
> [~mengxr]
> Q-learning is a model-free reinforcement learning technique. Specifically, 
> Q-learning can be used to find an optimal action-selection policy for any 
> given (finite) Markov decision process (MDP). It works by learning an 
> action-value function that ultimately gives the expected utility of taking a 
> given action in a given state.One of the strengths of Q-learning is that it 
> is able to compare the expected utility of the available actions without 
> requiring a model of the environment. Additionally, Q-learning can handle 
> problems with stochastic transitions and rewards, without requiring any 
> adaptations.
> It can be used in artificial intelligence.
> we will use MapReduce for RL with Linear Function Approximation to 
> implementation it. some detail can be find 
> ：[https://ewrl.files.wordpress.com/2011/08/ewrl2011_submission_11.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12751) Traits generated by SharedParamsCodeGen should not be private

2016-01-11 Thread Wojciech Jurczyk (JIRA)

Wojciech Jurczyk created SPARK-12751:


 Summary: Traits generated by SharedParamsCodeGen should not be 
private
 Key: SPARK-12751
 URL: https://issues.apache.org/jira/browse/SPARK-12751
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.6.0, 1.5.2
Reporter: Wojciech Jurczyk


Many Estimators and Transformers mix in traits generated by 
SharedParamsCodeGen. These estimators and transformers (like StringIndexer, 
MinMaxScaler etc) are accessible publicly while traits generated by 
SharedParamsCodeGen are private\[ml\]. From user's code it is possible to 
invoke methods that the traits introduce but it is illegal to use any trait 
explicitly. For example, you can call setInputCol(str) on StringIndexer but you 
are not allowed to assign StringIndexer to a variable of type HasInputCol.
{code:java}
val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
{code}
For example, it is impossible to create a collection of transformers that have 
both HasInputCol and HasOutputCol (e.g. Set\[Transformer with HasInputCol with 
HasOutputCol\]). We have to use structural typing and reflective calls like 
this:
{code}
ml.Estimator[_] { val outputCol: ml.param.Param[String] }
{code}

This seems easy to fix, exposing a couple of traits should not break anything. 
On the other hand, maybe it goes deeper than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12751) Traits generated by SharedParamsCodeGen should not be private

2016-01-11 Thread Wojciech Jurczyk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wojciech Jurczyk updated SPARK-12751:
-
Priority: Minor  (was: Major)

> Traits generated by SharedParamsCodeGen should not be private
> -
>
> Key: SPARK-12751
> URL: https://issues.apache.org/jira/browse/SPARK-12751
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Wojciech Jurczyk
>Priority: Minor
>
> Many Estimators and Transformers mix in traits generated by 
> SharedParamsCodeGen. These estimators and transformers (like StringIndexer, 
> MinMaxScaler etc) are accessible publicly while traits generated by 
> SharedParamsCodeGen are private\[ml\]. From user's code it is possible to 
> invoke methods that the traits introduce but it is illegal to use any trait 
> explicitly. For example, you can call setInputCol(str) on StringIndexer but 
> you are not allowed to assign StringIndexer to a variable of type HasInputCol.
> {code:java}
> val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
> {code}
> For example, it is impossible to create a collection of transformers that 
> have both HasInputCol and HasOutputCol (e.g. Set\[Transformer with 
> HasInputCol with HasOutputCol\]). We have to use structural typing and 
> reflective calls like this:
> {code}
> ml.Estimator[_] { val outputCol: ml.param.Param[String] }
> {code}
> This seems easy to fix, exposing a couple of traits should not break 
> anything. On the other hand, maybe it goes deeper than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6589) SQLUserDefinedType failed in spark-shell

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6589.
--
Resolution: Not A Problem

I think this is, effectively, "not a problem" in the sense that this is just 
how the shell works. It necessarily must put classes into its classloader, 
which is a child of Spark's, and Spark can't see your classes, and you must 
supply your classes with Spark to make it work. This basically won't work.

> SQLUserDefinedType failed in spark-shell
> 
>
> Key: SPARK-6589
> URL: https://issues.apache.org/jira/browse/SPARK-6589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: CDH 5.3.2
>Reporter: Benyi Wang
>
> {{DataType.fromJson}} will fail in spark-shell if the schema includes "udt". 
> It works if running in an application. 
> This causes that I cannot read a parquet file including a UDT field. 
> {{DataType.fromCaseClass}} does not support UDT.
> I can load the class which shows that my UDT is in the classpath.
> {code}
> scala> Class.forName("com.bwang.MyTestUDT")
> res6: Class[_] = class com.bwang.MyTestUDT
> {code}
> But DataType fails:
> {code}
> scala> DataType.fromJson(json)
>   
> java.lang.ClassNotFoundException: com.bwang.MyTestUDT
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:190)
> at 
> org.apache.spark.sql.catalyst.types.DataType$.parseDataType(dataTypes.scala:77)
> {code}
> The reason is DataType.fromJson tries to load {{udtClass}} using this code:
> {code}
> case JSortedObject(
> ("class", JString(udtClass)),
> ("pyClass", _),
> ("sqlType", _),
> ("type", JString("udt"))) =>
>   Class.forName(udtClass).newInstance().asInstanceOf[UserDefinedType[_]]
>   }
> {code}
> Unfortunately, my UDT is loaded by {{SparkIMain$TranslatingClassLoader}}, but 
> DataType is loaded by {{Launcher$AppClassLoader}}.
> {code}
> scala> DataType.getClass.getClassLoader
> res2: ClassLoader = sun.misc.Launcher$AppClassLoader@6876fb1b
> scala> this.getClass.getClassLoader
> res3: ClassLoader = 
> org.apache.spark.repl.SparkIMain$TranslatingClassLoader@63d36b29
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6401) Unable to load a old API input format in Spark streaming

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6401.
--
Resolution: Won't Fix

Although the old APIs still exist in new Hadoop, given we're moving away from 
Hadoop 1 and early Hadoop 2, it may be time to say this won't be added

> Unable to load a old API input format in Spark streaming
> 
>
> Key: SPARK-6401
> URL: https://issues.apache.org/jira/browse/SPARK-6401
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Rémy DUBOIS
>Priority: Minor
>
> The fileStream method of the JavaStreamingContext class does not allow using 
> a old API InputFormat.
> This feature exists in Spark batch but not in streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6282) Strange Python import error when using random() in a lambda function

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6282.
--
Resolution: Not A Problem

Seems like some issue with Python libraries

> Strange Python import error when using random() in a lambda function
> 
>
> Key: SPARK-6282
> URL: https://issues.apache.org/jira/browse/SPARK-6282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Kubuntu 14.04, Python 2.7.6
>Reporter: Pavel Laskov
>Priority: Minor
>
> Consider the exemplary Python code below:
>from random import random
>from pyspark.context import SparkContext
>from xval_mllib import read_csv_file_as_list
> if __name__ == "__main__": 
> sc = SparkContext(appName="Random() bug test")
> data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
> #data = sc.parallelize([1, 2, 3, 4, 5], 2)
> d = data.map(lambda x: (random(), x))
> print d.first()
> Data is read from a large CSV file. Running this code results in a Python 
> import error:
> ImportError: No module named _winreg
> If I use 'import random' and 'random.random()' in the lambda function no 
> error occurs. Also no error occurs, for both kinds of import statements, for 
> a small artificial data set like the one shown in a commented line.  
> The full error trace, the source code of csv reading code (function 
> 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
> dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6616) IsStopped set to true in before stop() is complete.

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6616.
--
Resolution: Not A Problem

I think this was obsoleted by subsequent changes that made this more 
thread-safe.

> IsStopped set to true in before stop() is complete.
> ---
>
> Key: SPARK-6616
> URL: https://issues.apache.org/jira/browse/SPARK-6616
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Ilya Ganelin
>
> There are numerous instances throughout the code base of the following:
> {code}
> if (!stopped) {
> stopped = true
> ...
> }
> {code}
> In general, this is bad practice since it can cause an incomplete cleanup if 
> there is an error during shutdown and not all code executes. Incomplete 
> cleanup is harder to track down than a double cleanup that triggers some 
> error. I propose fixing this throughout the code, starting with the cleanup 
> sequence with {code}SparkContext.stop() {code}
> A cursory examination reveals this in {code}SparkContext.stop(), 
> SparkEnv.stop(), and ContextCleaner.stop() {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-975) Spark Replay Debugger

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-975.
-
Resolution: Won't Fix

> Spark Replay Debugger
> -
>
> Key: SPARK-975
> URL: https://issues.apache.org/jira/browse/SPARK-975
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: liancheng
>  Labels: arthur, debugger
> Attachments: IMG_20140722_184149.jpg, RDD DAG.png
>
>
> The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical 
> report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf].
> [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur 
> Dave|https://github.com/ankurdave], is an old implementation of the Spark 
> debugger, which demonstrated both the elegance and power behind the RDD 
> abstraction.  Unfortunately, the corresponding GitHub branch was not merged 
> into the master branch and had stopped 2 years ago.  For more information 
> about Arthur, please refer to [the Spark Debugger Wiki 
> page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub 
> repository.
> As a useful tool for Spark application debugging and analysis, it would be 
> nice to have a complete Spark debugger.  In 
> [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new 
> implementation of the Spark debugger, the Spark Replay Debugger (SRD).
> [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview 
> for discussion.  In the current version, I only implemented features that can 
> illustrate the basic mechanisms.  There are still features appeared in Arthur 
> but missing in SRD, such as checksum based nondeterminsm detection and single 
> task debugging with conventional debugger (like {{jdb}}).  However, these 
> features can be easily built upon current SRD framework.  To minimize code 
> review effort, I didn't include them into the current version intentionally.
> Attached is the visualization of the MLlib ALS application (with 1 iteration) 
> generated by SRD.  For more information, please refer to [the SRD overview 
> document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12751) Traits generated by SharedParamsCodeGen should not be private

2016-01-11 Thread Wojciech Jurczyk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wojciech Jurczyk updated SPARK-12751:
-
Description: 
Many Estimators and Transformers mix in traits generated by 
[SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala].
 These estimators and transformers (like StringIndexer, MinMaxScaler etc) are 
accessible publicly while traits generated by SharedParamsCodeGen are 
private\[ml\]. From user's code it is possible to invoke methods that the 
traits introduce but it is illegal to use any trait explicitly. For example, 
you can call setInputCol(str) on StringIndexer but you are not allowed to 
assign StringIndexer to a variable of type HasInputCol.
{code:java}
val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
{code}
For example, it is impossible to create a collection of transformers that have 
both HasInputCol and HasOutputCol (e.g. Set\[Transformer with HasInputCol with 
HasOutputCol\]). We have to use structural typing and reflective calls like 
this:
{code}
ml.Estimator[_] { val outputCol: ml.param.Param[String] }
{code}

This seems easy to fix, exposing a couple of traits should not break anything. 
On the other hand, maybe it goes deeper than that.

  was:
Many Estimators and Transformers mix in traits generated by 
SharedParamsCodeGen. These estimators and transformers (like StringIndexer, 
MinMaxScaler etc) are accessible publicly while traits generated by 
SharedParamsCodeGen are private\[ml\]. From user's code it is possible to 
invoke methods that the traits introduce but it is illegal to use any trait 
explicitly. For example, you can call setInputCol(str) on StringIndexer but you 
are not allowed to assign StringIndexer to a variable of type HasInputCol.
{code:java}
val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
{code}
For example, it is impossible to create a collection of transformers that have 
both HasInputCol and HasOutputCol (e.g. Set\[Transformer with HasInputCol with 
HasOutputCol\]). We have to use structural typing and reflective calls like 
this:
{code}
ml.Estimator[_] { val outputCol: ml.param.Param[String] }
{code}

This seems easy to fix, exposing a couple of traits should not break anything. 
On the other hand, maybe it goes deeper than that.


> Traits generated by SharedParamsCodeGen should not be private
> -
>
> Key: SPARK-12751
> URL: https://issues.apache.org/jira/browse/SPARK-12751
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Wojciech Jurczyk
>Priority: Minor
>
> Many Estimators and Transformers mix in traits generated by 
> [SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala].
>  These estimators and transformers (like StringIndexer, MinMaxScaler etc) are 
> accessible publicly while traits generated by SharedParamsCodeGen are 
> private\[ml\]. From user's code it is possible to invoke methods that the 
> traits introduce but it is illegal to use any trait explicitly. For example, 
> you can call setInputCol(str) on StringIndexer but you are not allowed to 
> assign StringIndexer to a variable of type HasInputCol.
> {code:java}
> val x: HasInputCol = new StringIndexer() // Usage of HasInputCol is illegal.
> {code}
> For example, it is impossible to create a collection of transformers that 
> have both HasInputCol and HasOutputCol (e.g. Set\[Transformer with 
> HasInputCol with HasOutputCol\]). We have to use structural typing and 
> reflective calls like this:
> {code}
> ml.Estimator[_] { val outputCol: ml.param.Param[String] }
> {code}
> This seems easy to fix, exposing a couple of traits should not break 
> anything. On the other hand, maybe it goes deeper than that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7615) MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091762#comment-15091762
 ] 

Apache Spark commented on SPARK-7615:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10696

> MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero 
> 
>
> Key: SPARK-7615
> URL: https://issues.apache.org/jira/browse/SPARK-7615
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Eric Li
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In Word2VecModel, wordVecNorms may contains Euclidean Norm equals to zero. 
> This will cause incorrect calculation for cosine distance. when you do 
> cosineVec(ind) / wordVecNorms(ind). Cosine distance should be equal to 0 for 
> norm = 0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it

2016-01-11 Thread Adrian Bridgett (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091792#comment-15091792
 ] 

Adrian Bridgett commented on SPARK-12622:
-

Hi Ajesh - I'm not sure what more I can add that's not already here:

- jar file in /tmp/ - e.g. /tmp/f oo.jar
- spark-submit --class Foo "/tmp/f oo.jar" fails on executors (no such class)
- mv "/tmp/f oo.jar" /tmp/foo.jar
- spark-submit --class Foo "/tmp/foo.jar"  works

spark-defaults.conf contains (amongst tuning lines):
spark.master 
mesos://zk://mesos-1.example.net:2181,mesos-2.example.net:2181,mesos-3.example.net:2181/mesos

> spark-submit fails on executors when jar has a space in it
> --
>
> Key: SPARK-12622
> URL: https://issues.apache.org/jira/browse/SPARK-12622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
> Environment: Linux, Mesos 
>Reporter: Adrian Bridgett
>Priority: Minor
>
> spark-submit --class foo "Foo.jar"  works
> but when using "f oo.jar" it starts to run and then breaks on the executors 
> as they cannot find the various functions.
> Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this 
> fails immediately.
> {noformat}
> spark-submit --class Foo /tmp/f\ oo.jar
> ...
> spark.jars=file:/tmp/f%20oo.jar
> 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at 
> http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769
> 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 
> (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: 
> Foo$$anonfun$46
> {noformat}
> SPARK-6568 is related but maybe specific to the Windows environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12066) spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join

2016-01-11 Thread Kaiyuan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091864#comment-15091864
 ] 

Kaiyuan Yang commented on SPARK-12066:
--

Dirty data could trigger a lot of “strange” exceptions, We should correct the 
data.

> spark sql  throw java.lang.ArrayIndexOutOfBoundsException when use table.* 
> with join 
> -
>
> Key: SPARK-12066
> URL: https://issues.apache.org/jira/browse/SPARK-12066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.2
> Environment: linux 
>Reporter: Ricky Yang
>Priority: Blocker
>
> throw java.lang.ArrayIndexOutOfBoundsException  when I use following spark 
> sql on spark standlone or yarn.
>the sql:
> select ta.* 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ; 
> But ,the result is correct when using no * as following:
> select ta.sale_dt 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ; 
> standlone version is 1.4.0 and version spark on yarn  is 1.5.2
> error log :
>   
> 15/11/30 14:19:59 ERROR SparkSQLDriver: Failed in [select ta.* 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ] 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, namenode2-sit.cnsuning.com): java.lang.ArrayIndexOutOfBoundsException 
> Driver stacktrace: 
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>  
> at scala.Option.foreach(Option.scala:236) 
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>  
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) 
> at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215) 
> at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207) 
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308)
>  
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) 
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311) 
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409) 
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425) 
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>  
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apa

[jira] [Created] (SPARK-12752) Can Thrift Server connect to Hive Metastore?

2016-01-11 Thread Tao Wang (JIRA)

Tao Wang created SPARK-12752:


 Summary: Can Thrift Server connect to Hive Metastore?
 Key: SPARK-12752
 URL: https://issues.apache.org/jira/browse/SPARK-12752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Tao Wang


Now Thrift Server can directly connect to database such as mysql to store its 
metadata. Technically it should be fine to connect to Hive Metastore. 

But when we try to do so with setting `hive.metastore.uris` to thrift url of 
Hive Metastore and found exception message in log while execute "create table 
t1 (name string)".

It is in secure mode and the new added configuration is like: 

hive.metastore.uris
thrift://9.96.1.116:21088,thrift://9.96.1.115:21088,thrift://9.96.1.114:21088


hive.metastore.sasl.enabled
true


hive.metastore.kerberos.principal
hive/hadoop.hadoop@hadoop.com


Later I will attach the log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12066) spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join

2016-01-11 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12066:
--
Priority: Major  (was: Blocker)

[~ourui521314] don't set blocker; this sounds like a data problem

> spark sql  throw java.lang.ArrayIndexOutOfBoundsException when use table.* 
> with join 
> -
>
> Key: SPARK-12066
> URL: https://issues.apache.org/jira/browse/SPARK-12066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.2
> Environment: linux 
>Reporter: Ricky Yang
>
> throw java.lang.ArrayIndexOutOfBoundsException  when I use following spark 
> sql on spark standlone or yarn.
>the sql:
> select ta.* 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ; 
> But ,the result is correct when using no * as following:
> select ta.sale_dt 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ; 
> standlone version is 1.4.0 and version spark on yarn  is 1.5.2
> error log :
>   
> 15/11/30 14:19:59 ERROR SparkSQLDriver: Failed in [select ta.* 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ] 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, namenode2-sit.cnsuning.com): java.lang.ArrayIndexOutOfBoundsException 
> Driver stacktrace: 
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>  
> at scala.Option.foreach(Option.scala:236) 
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>  
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) 
> at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:215) 
> at 
> org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207) 
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:587)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:308)
>  
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) 
> at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311) 
> at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:409) 
> at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:425) 
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>  
> at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>  
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:606) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(Sp

[jira] [Commented] (SPARK-12066) spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join

2016-01-11 Thread Ricky Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091934#comment-15091934
 ] 

Ricky Yang commented on SPARK-12066:


yes, it's a  data problem.so try catch following code:
  at 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryUtils.byteArrayToLong(LazyBinaryUtils.java:81)
 

the exception is :
15/12/03 15:53:43 INFO hive.HadoopTableReader: 
mutableRow.getString(0),mutableRow.getString(1),mutableRow.getString(2)
15/12/03 15:53:43 INFO hive.HadoopTableReader: 173732,201405,20130104
15/12/03 15:53:43 INFO hive.HadoopTableReader:  exception 
fieldRefs(i):39:sale_cnt
java.lang.ArrayIndexOutOfBoundsException: 9731
at 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryUtils.byteArrayToLong(LazyBinaryUtils.java:78)
at 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryDouble.init(LazyBinaryDouble.java:43)
at 
org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:111)
at 
org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:172)
at 
org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:67)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:390)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:381)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


The ogiginal SparkSQL  code catch the exception and set this value to  null？


> spark sql  throw java.lang.ArrayIndexOutOfBoundsException when use table.* 
> with join 
> -
>
> Key: SPARK-12066
> URL: https://issues.apache.org/jira/browse/SPARK-12066
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.2
> Environment: linux 
>Reporter: Ricky Yang
>
> throw java.lang.ArrayIndexOutOfBoundsException  when I use following spark 
> sql on spark standlone or yarn.
>the sql:
> select ta.* 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale_dt = '20140514' 
> and ta.sale_price >= tb.pri_from 
> and ta.sale_price < tb.pri_to limit 10 ; 
> But ,the result is correct when using no * as following:
> select ta.sale_dt 
> from bi_td.dm_price_seg_td tb 
> join bi_sor.sor_ord_detail_tf ta 
> on 1 = 1 
> where ta.sale

[jira] [Updated] (SPARK-12752) Can Thrift Server connect to Hive Metastore?

2016-01-11 Thread Tao Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-12752:
-
Attachment: JDBCServer.log

> Can Thrift Server connect to Hive Metastore?
> 
>
> Key: SPARK-12752
> URL: https://issues.apache.org/jira/browse/SPARK-12752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Tao Wang
> Attachments: JDBCServer.log
>
>
> Now Thrift Server can directly connect to database such as mysql to store its 
> metadata. Technically it should be fine to connect to Hive Metastore. 
> But when we try to do so with setting `hive.metastore.uris` to thrift url of 
> Hive Metastore and found exception message in log while execute "create table 
> t1 (name string)".
> It is in secure mode and the new added configuration is like: 
> 
> hive.metastore.uris
> thrift://9.96.1.116:21088,thrift://9.96.1.115:21088,thrift://9.96.1.114:21088
> 
> 
> hive.metastore.sasl.enabled
> true
> 
> 
> hive.metastore.kerberos.principal
> hive/hadoop.hadoop@hadoop.com
> 
> Later I will attach the log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2930) clarify docs on using webhdfs with spark.yarn.access.namenodes

2016-01-11 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091941#comment-15091941
 ] 

Thomas Graves commented on SPARK-2930:
--

I think we should still document this.  Its a one line change I'll try to get 
something up today

> clarify docs on using webhdfs with spark.yarn.access.namenodes
> --
>
> Key: SPARK-2930
> URL: https://issues.apache.org/jira/browse/SPARK-2930
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
>
> The documentation of spark.yarn.access.namenodes talks about putting 
> namenodes in it and gives example with hdfs://.  
> I can also be used with webhdfs so we should clarify how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-2930) clarify docs on using webhdfs with spark.yarn.access.namenodes

2016-01-11 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reopened SPARK-2930:
--

> clarify docs on using webhdfs with spark.yarn.access.namenodes
> --
>
> Key: SPARK-2930
> URL: https://issues.apache.org/jira/browse/SPARK-2930
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
>
> The documentation of spark.yarn.access.namenodes talks about putting 
> namenodes in it and gives example with hdfs://.  
> I can also be used with webhdfs so we should clarify how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12752) Can Thrift Server connect to Hive Metastore?

2016-01-11 Thread Tao Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-12752:
-
Description: 
Before we use Thrift Server directly connecting to database such as mysql to 
store its metadata. Now we wanna read data stored by Hive so we try to connect 
to Hive Metastore with Thrift Server.

In non-secure mode it is ok with me by setting `hive.metastore.uris` to thrift 
url of Hive Metastore. But when testing it in secure cluster we met some 
problem kerberos related.

The error message is showed as attached log. The sql statement is "create table 
t1 (name string)" which is handled by HiveQl.

The new added configuration for security is like: 

hive.metastore.uris
thrift://9.96.1.116:21088,thrift://9.96.1.115:21088,thrift://9.96.1.114:21088


hive.metastore.sasl.enabled
true


hive.metastore.kerberos.principal
hive/hadoop.hadoop@hadoop.com


Later I will attach the log.

  was:
Now Thrift Server can directly connect to database such as mysql to store its 
metadata. Technically it should be fine to connect to Hive Metastore. 

But when we try to do so with setting `hive.metastore.uris` to thrift url of 
Hive Metastore and found exception message in log while execute "create table 
t1 (name string)".

It is in secure mode and the new added configuration is like: 

hive.metastore.uris
thrift://9.96.1.116:21088,thrift://9.96.1.115:21088,thrift://9.96.1.114:21088


hive.metastore.sasl.enabled
true


hive.metastore.kerberos.principal
hive/hadoop.hadoop@hadoop.com


Later I will attach the log.


> Can Thrift Server connect to Hive Metastore?
> 
>
> Key: SPARK-12752
> URL: https://issues.apache.org/jira/browse/SPARK-12752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Tao Wang
> Attachments: JDBCServer.log
>
>
> Before we use Thrift Server directly connecting to database such as mysql to 
> store its metadata. Now we wanna read data stored by Hive so we try to 
> connect to Hive Metastore with Thrift Server.
> In non-secure mode it is ok with me by setting `hive.metastore.uris` to 
> thrift url of Hive Metastore. But when testing it in secure cluster we met 
> some problem kerberos related.
> The error message is showed as attached log. The sql statement is "create 
> table t1 (name string)" which is handled by HiveQl.
> The new added configuration for security is like: 
> 
> hive.metastore.uris
> thrift://9.96.1.116:21088,thrift://9.96.1.115:21088,thrift://9.96.1.114:21088
> 
> 
> hive.metastore.sasl.enabled
> true
> 
> 
> hive.metastore.kerberos.principal
> hive/hadoop.hadoop@hadoop.com
> 
> Later I will attach the log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12752) Can Thrift Server connect to Hive Metastore?

2016-01-11 Thread Tao Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-12752:
-
Description: 
Before we use Thrift Server directly connecting to database such as mysql to 
store its metadata. Now we wanna read data stored by Hive so we try to connect 
to Hive Metastore with Thrift Server.

In non-secure mode it is ok with me by setting `hive.metastore.uris` to thrift 
url of Hive Metastore. But when testing it in secure cluster we met some 
problem kerberos related.

The error message is showed as attached log. The sql statement is "create table 
t1 (name string)" which is handled by HiveQl.

The new added configuration for security is like: 

hive.metastore.uris
thrift://9.96.1.116:21088,thrift://9.96.1.115:21088,thrift://9.96.1.114:21088


hive.metastore.sasl.enabled
true


hive.metastore.kerberos.principal
hive/hadoop.hadoop@hadoop.com


I don't understand too much about Hive. But technically I think it should work 
with this mode. Please guys who has experience can give some advise. Thanks :)

  was:
Before we use Thrift Server directly connecting to database such as mysql to 
store its metadata. Now we wanna read data stored by Hive so we try to connect 
to Hive Metastore with Thrift Server.

In non-secure mode it is ok with me by setting `hive.metastore.uris` to thrift 
url of Hive Metastore. But when testing it in secure cluster we met some 
problem kerberos related.

The error message is showed as attached log. The sql statement is "create table 
t1 (name string)" which is handled by HiveQl.

The new added configuration for security is like: 

hive.metastore.uris
thrift://9.96.1.116:21088,thrift://9.96.1.115:21088,thrift://9.96.1.114:21088


hive.metastore.sasl.enabled
true


hive.metastore.kerberos.principal
hive/hadoop.hadoop@hadoop.com


Later I will attach the log.


> Can Thrift Server connect to Hive Metastore?
> 
>
> Key: SPARK-12752
> URL: https://issues.apache.org/jira/browse/SPARK-12752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Tao Wang
> Attachments: JDBCServer.log
>
>
> Before we use Thrift Server directly connecting to database such as mysql to 
> store its metadata. Now we wanna read data stored by Hive so we try to 
> connect to Hive Metastore with Thrift Server.
> In non-secure mode it is ok with me by setting `hive.metastore.uris` to 
> thrift url of Hive Metastore. But when testing it in secure cluster we met 
> some problem kerberos related.
> The error message is showed as attached log. The sql statement is "create 
> table t1 (name string)" which is handled by HiveQl.
> The new added configuration for security is like: 
> 
> hive.metastore.uris
> thrift://9.96.1.116:21088,thrift://9.96.1.115:21088,thrift://9.96.1.114:21088
> 
> 
> hive.metastore.sasl.enabled
> true
> 
> 
> hive.metastore.kerberos.principal
> hive/hadoop.hadoop@hadoop.com
> 
> I don't understand too much about Hive. But technically I think it should 
> work with this mode. Please guys who has experience can give some advise. 
> Thanks :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2930) clarify docs on using webhdfs with spark.yarn.access.namenodes

2016-01-11 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091965#comment-15091965
 ] 

Thomas Graves commented on SPARK-2930:
--

I think simply putting a webhdfs url in the examples should be good here.

> clarify docs on using webhdfs with spark.yarn.access.namenodes
> --
>
> Key: SPARK-2930
> URL: https://issues.apache.org/jira/browse/SPARK-2930
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
>
> The documentation of spark.yarn.access.namenodes talks about putting 
> namenodes in it and gives example with hdfs://.  
> I can also be used with webhdfs so we should clarify how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2930) clarify docs on using webhdfs with spark.yarn.access.namenodes

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2930:
---

Assignee: Apache Spark  (was: Thomas Graves)

> clarify docs on using webhdfs with spark.yarn.access.namenodes
> --
>
> Key: SPARK-2930
> URL: https://issues.apache.org/jira/browse/SPARK-2930
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Minor
>
> The documentation of spark.yarn.access.namenodes talks about putting 
> namenodes in it and gives example with hdfs://.  
> I can also be used with webhdfs so we should clarify how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-2930) clarify docs on using webhdfs with spark.yarn.access.namenodes

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2930:
---

Assignee: Thomas Graves  (was: Apache Spark)

> clarify docs on using webhdfs with spark.yarn.access.namenodes
> --
>
> Key: SPARK-2930
> URL: https://issues.apache.org/jira/browse/SPARK-2930
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
>
> The documentation of spark.yarn.access.namenodes talks about putting 
> namenodes in it and gives example with hdfs://.  
> I can also be used with webhdfs so we should clarify how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2930) clarify docs on using webhdfs with spark.yarn.access.namenodes

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091969#comment-15091969
 ] 

Apache Spark commented on SPARK-2930:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/10699

> clarify docs on using webhdfs with spark.yarn.access.namenodes
> --
>
> Key: SPARK-2930
> URL: https://issues.apache.org/jira/browse/SPARK-2930
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Minor
>
> The documentation of spark.yarn.access.namenodes talks about putting 
> namenodes in it and gives example with hdfs://.  
> I can also be used with webhdfs so we should clarify how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()

2016-01-11 Thread Dat Tran (JIRA)

Dat Tran created SPARK-12753:


 Summary: Import error during unit test while calling a function 
from reduceByKey()
 Key: SPARK-12753
 URL: https://issues.apache.org/jira/browse/SPARK-12753
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 1.6.0
 Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, 
Anaconda 
Reporter: Dat Tran
Priority: Trivial


The current directory structure for my test script is as follows:
project/
  script/
 __init__.py 
 map.py
  test/
__init.py__
test_map.py

I have attached map.py and test_map.py file with this issue. 

When I run the nosetest in the test directory, the test fails. I get no module 
named "script" found error. 
However when I modify the map_add function to replace the call to add within 
reduceByKey in map.py like this:

def map_add(df):
result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: x+y)
return result

The test passes.

Also, when I run the original test_map.py from the project directory, the test 
passes. 

I am not able to figure out why the test doesn't detect the script module when 
it is within the test directory. 

I have also attached the log error file. Any help will be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()

2016-01-11 Thread Dat Tran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dat Tran updated SPARK-12753:
-
Attachment: map.py

> Import error during unit test while calling a function from reduceByKey()
> -
>
> Key: SPARK-12753
> URL: https://issues.apache.org/jira/browse/SPARK-12753
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, 
> Anaconda 
>Reporter: Dat Tran
>Priority: Trivial
>  Labels: pyspark, python3, unit-test
> Attachments: map.py
>
>
> The current directory structure for my test script is as follows:
> project/
>   script/
>  __init__.py 
>  map.py
>   test/
> __init.py__
> test_map.py
> I have attached map.py and test_map.py file with this issue. 
> When I run the nosetest in the test directory, the test fails. I get no 
> module named "script" found error. 
> However when I modify the map_add function to replace the call to add within 
> reduceByKey in map.py like this:
> def map_add(df):
> result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: 
> x+y)
> return result
> The test passes.
> Also, when I run the original test_map.py from the project directory, the 
> test passes. 
> I am not able to figure out why the test doesn't detect the script module 
> when it is within the test directory. 
> I have also attached the log error file. Any help will be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()

2016-01-11 Thread Dat Tran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dat Tran updated SPARK-12753:
-
Attachment: log.txt
test_map.py

> Import error during unit test while calling a function from reduceByKey()
> -
>
> Key: SPARK-12753
> URL: https://issues.apache.org/jira/browse/SPARK-12753
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, 
> Anaconda 
>Reporter: Dat Tran
>Priority: Trivial
>  Labels: pyspark, python3, unit-test
> Attachments: log.txt, map.py, test_map.py
>
>
> The current directory structure for my test script is as follows:
> project/
>   script/
>  __init__.py 
>  map.py
>   test/
> __init.py__
> test_map.py
> I have attached map.py and test_map.py file with this issue. 
> When I run the nosetest in the test directory, the test fails. I get no 
> module named "script" found error. 
> However when I modify the map_add function to replace the call to add within 
> reduceByKey in map.py like this:
> def map_add(df):
> result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: 
> x+y)
> return result
> The test passes.
> Also, when I run the original test_map.py from the project directory, the 
> test passes. 
> I am not able to figure out why the test doesn't detect the script module 
> when it is within the test directory. 
> I have also attached the log error file. Any help will be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()

2016-01-11 Thread Dat Tran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dat Tran updated SPARK-12753:
-
Description: 
The current directory structure for my test script is as follows:
project/
   script/
  __init__.py 
  map.py
   test/
 __init.py__
 test_map.py

I have attached map.py and test_map.py file with this issue. 

When I run the nosetest in the test directory, the test fails. I get no module 
named "script" found error. 
However when I modify the map_add function to replace the call to add within 
reduceByKey in map.py like this:

def map_add(df):
result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: x+y)
return result

The test passes.

Also, when I run the original test_map.py from the project directory, the test 
passes. 

I am not able to figure out why the test doesn't detect the script module when 
it is within the test directory. 

I have also attached the log error file. Any help will be much appreciated.

  was:
The current directory structure for my test script is as follows:
project/
  script/
 __init__.py 
 map.py
  test/
__init.py__
test_map.py

I have attached map.py and test_map.py file with this issue. 

When I run the nosetest in the test directory, the test fails. I get no module 
named "script" found error. 
However when I modify the map_add function to replace the call to add within 
reduceByKey in map.py like this:

def map_add(df):
result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: x+y)
return result

The test passes.

Also, when I run the original test_map.py from the project directory, the test 
passes. 

I am not able to figure out why the test doesn't detect the script module when 
it is within the test directory. 

I have also attached the log error file. Any help will be much appreciated.


> Import error during unit test while calling a function from reduceByKey()
> -
>
> Key: SPARK-12753
> URL: https://issues.apache.org/jira/browse/SPARK-12753
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, 
> Anaconda 
>Reporter: Dat Tran
>Priority: Trivial
>  Labels: pyspark, python3, unit-test
> Attachments: log.txt, map.py, test_map.py
>
>
> The current directory structure for my test script is as follows:
> project/
>script/
>   __init__.py 
>   map.py
>test/
>  __init.py__
>  test_map.py
> I have attached map.py and test_map.py file with this issue. 
> When I run the nosetest in the test directory, the test fails. I get no 
> module named "script" found error. 
> However when I modify the map_add function to replace the call to add within 
> reduceByKey in map.py like this:
> def map_add(df):
> result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: 
> x+y)
> return result
> The test passes.
> Also, when I run the original test_map.py from the project directory, the 
> test passes. 
> I am not able to figure out why the test doesn't detect the script module 
> when it is within the test directory. 
> I have also attached the log error file. Any help will be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12754) Data type mismatch on two array values when using filter/where

2016-01-11 Thread Jesse English (JIRA)

Jesse English created SPARK-12754:
-

 Summary: Data type mismatch on two array values when using 
filter/where
 Key: SPARK-12754
 URL: https://issues.apache.org/jira/browse/SPARK-12754
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 1.5.0
 Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.5.0+
Reporter: Jesse English


The following test produces the error _org.apache.spark.sql.AnalysisException: 
cannot resolve '(point = array(0,9))' due to data type mismatch: differing 
types in '(point = array(0,9))' (array and array)_

This is not the case on 1.4.x, but has been introduced with 1.5+.  Is there a 
preferred method for making this sort of arbitrarily sized array comparison?

{code:title=test.scala}
test("test array comparison") {

val vectors: Vector[Row] =  Vector(
  Row.fromTuple("id_1" -> Array(0L, 2L)),
  Row.fromTuple("id_2" -> Array(0L, 5L)),
  Row.fromTuple("id_3" -> Array(0L, 9L)),
  Row.fromTuple("id_4" -> Array(1L, 0L)),
  Row.fromTuple("id_5" -> Array(1L, 8L)),
  Row.fromTuple("id_6" -> Array(2L, 4L)),
  Row.fromTuple("id_7" -> Array(5L, 6L)),
  Row.fromTuple("id_8" -> Array(6L, 2L)),
  Row.fromTuple("id_9" -> Array(7L, 0L))
)
val data: RDD[Row] = sc.parallelize(vectors, 3)

val schema = StructType(
  StructField("id", StringType, false) ::
StructField("point", DataTypes.createArrayType(LongType), false) ::
Nil
)

val sqlContext = new SQLContext(sc)
var dataframe = sqlContext.createDataFrame(data, schema)

val  targetPoint:Array[Long] = Array(0L,9L)

//This is the line where it fails
//org.apache.spark.sql.AnalysisException: cannot resolve 
// '(point = array(0,9))' due to data type mismatch:
// differing types in '(point = array(0,9))' 
// (array and array).

val targetRow = dataframe.where(dataframe("point") === 
array(targetPoint.map(value => lit(value)): _*)).first()

assert(targetRow != null)
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12754) Data type mismatch on two array values when using filter/where

2016-01-11 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092268#comment-15092268
 ] 

kevin yu commented on SPARK-12754:
--

I will look into this. 

> Data type mismatch on two array values when using filter/where
> --
>
> Key: SPARK-12754
> URL: https://issues.apache.org/jira/browse/SPARK-12754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.6.0
> Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.5.0+
>Reporter: Jesse English
>
> The following test produces the error 
> _org.apache.spark.sql.AnalysisException: cannot resolve '(point = 
> array(0,9))' due to data type mismatch: differing types in '(point = 
> array(0,9))' (array and array)_
> This is not the case on 1.4.x, but has been introduced with 1.5+.  Is there a 
> preferred method for making this sort of arbitrarily sized array comparison?
> {code:title=test.scala}
> test("test array comparison") {
> val vectors: Vector[Row] =  Vector(
>   Row.fromTuple("id_1" -> Array(0L, 2L)),
>   Row.fromTuple("id_2" -> Array(0L, 5L)),
>   Row.fromTuple("id_3" -> Array(0L, 9L)),
>   Row.fromTuple("id_4" -> Array(1L, 0L)),
>   Row.fromTuple("id_5" -> Array(1L, 8L)),
>   Row.fromTuple("id_6" -> Array(2L, 4L)),
>   Row.fromTuple("id_7" -> Array(5L, 6L)),
>   Row.fromTuple("id_8" -> Array(6L, 2L)),
>   Row.fromTuple("id_9" -> Array(7L, 0L))
> )
> val data: RDD[Row] = sc.parallelize(vectors, 3)
> val schema = StructType(
>   StructField("id", StringType, false) ::
> StructField("point", DataTypes.createArrayType(LongType), false) ::
> Nil
> )
> val sqlContext = new SQLContext(sc)
> var dataframe = sqlContext.createDataFrame(data, schema)
> val  targetPoint:Array[Long] = Array(0L,9L)
> //This is the line where it fails
> //org.apache.spark.sql.AnalysisException: cannot resolve 
> // '(point = array(0,9))' due to data type mismatch:
> // differing types in '(point = array(0,9))' 
> // (array and array).
> val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
> assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore

2016-01-11 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092425#comment-15092425
 ] 

Yin Huai commented on SPARK-12403:
--

[~lunendl] Can you try to add db name to the from clause and see if you can 
workaround the issue (using {{Select * from openquery(SPARK,'Select * from 
yourDBName.lunentest')}})?

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore

2016-01-11 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092427#comment-15092427
 ] 

Yin Huai commented on SPARK-12403:
--

[~lunendl] Also, have you reported to simba? If there is any public page that 
can tracks that issue, it will be good to post it at here.

(btw, from the error message, looks like the odbc driver got the wrong database 
name. I am not sure if it is a problem of the odbc driver or the spark sql's 
thrift server. We will try to investigate when we get a chance.)

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values

2016-01-11 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12744:
-
Assignee: Anatoliy Plastinin

> Inconsistent behavior parsing JSON with unix timestamp values
> -
>
> Key: SPARK-12744
> URL: https://issues.apache.org/jira/browse/SPARK-12744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Anatoliy Plastinin
>Assignee: Anatoliy Plastinin
>Priority: Minor
>  Labels: release_notes, releasenotes
>
> Let’s have following json
> {code}
> val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil)
> {code}
> Spark sql casts int to timestamp treating int value as a number of seconds.
> https://issues.apache.org/jira/browse/SPARK-11724
> {code}
> scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show
> ++
> |  ts|
> ++
> |2016-01-10 01:37:...|
> ++
> {code}
> However parsing json with schema gives different result
> {code}
> scala> val schema = (new StructType).add("ts", TimestampType)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(ts,TimestampType,true))
> scala> sqlContext.read.schema(schema).json(rdd).show
> ++
> |  ts|
> ++
> |1970-01-17 20:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values

2016-01-11 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12744.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

This issue has been resolved by https://github.com/apache/spark/pull/10687.

> Inconsistent behavior parsing JSON with unix timestamp values
> -
>
> Key: SPARK-12744
> URL: https://issues.apache.org/jira/browse/SPARK-12744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Anatoliy Plastinin
>Assignee: Anatoliy Plastinin
>Priority: Minor
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Let’s have following json
> {code}
> val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil)
> {code}
> Spark sql casts int to timestamp treating int value as a number of seconds.
> https://issues.apache.org/jira/browse/SPARK-11724
> {code}
> scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show
> ++
> |  ts|
> ++
> |2016-01-10 01:37:...|
> ++
> {code}
> However parsing json with schema gives different result
> {code}
> scala> val schema = (new StructType).add("ts", TimestampType)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(ts,TimestampType,true))
> scala> sqlContext.read.schema(schema).json(rdd).show
> ++
> |  ts|
> ++
> |1970-01-17 20:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values

2016-01-11 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092441#comment-15092441
 ] 

Yin Huai commented on SPARK-12744:
--

[~antlypls] Can you add a comment to summarize the change (it will help us to 
prepare the release notes)?

> Inconsistent behavior parsing JSON with unix timestamp values
> -
>
> Key: SPARK-12744
> URL: https://issues.apache.org/jira/browse/SPARK-12744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Anatoliy Plastinin
>Assignee: Anatoliy Plastinin
>Priority: Minor
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Let’s have following json
> {code}
> val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil)
> {code}
> Spark sql casts int to timestamp treating int value as a number of seconds.
> https://issues.apache.org/jira/browse/SPARK-11724
> {code}
> scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show
> ++
> |  ts|
> ++
> |2016-01-10 01:37:...|
> ++
> {code}
> However parsing json with schema gives different result
> {code}
> scala> val schema = (new StructType).add("ts", TimestampType)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(ts,TimestampType,true))
> scala> sqlContext.read.schema(schema).json(rdd).show
> ++
> |  ts|
> ++
> |1970-01-17 20:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.

2016-01-11 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092445#comment-15092445
 ] 

Jean-Baptiste Onofré commented on SPARK-12430:
--

I think it's related to this commit:

{code}
52f5754 Marcelo Vanzin on 1/21/15 at 11:38 PM (committed by Josh Rosen on 
2/2/15 at 11:01 PM)
Make sure only owner can read / write to directories created for the job.
Whenever a directory is created by the utility method, immediately restrict
its permissions so that only the owner has access to its contents.
Signed-off-by: Josh Rosen 
{code}

As it can be checked with the extras/java8-test, I will verify.

Sorry for the delay, I keep you posted.

> Temporary folders do not get deleted after Task completes causing problems 
> with disk space.
> ---
>
> Key: SPARK-12430
> URL: https://issues.apache.org/jira/browse/SPARK-12430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
> Environment: Ubuntu server
>Reporter: Fede Bar
>
> We are experiencing an issue with automatic /tmp folder deletion after 
> framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
> Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
> disk space on server to exhaust. 
> Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  , 
>  */tmp/spark-#/blockmgr-#*
> - When task is completed */tmp/spark-#/* gets deleted along with 
> */tmp/spark-#/blockmgr-#* sub-folder.
> Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
> */tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
> - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
> container folder {color:red} /tmp/blockmgr-***{color}
> Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several 
> GB depending on the job that ran. Over time this causes disk space to become 
> full with consequences that we all know. 
> Running a shell script would probably work but it is difficult to identify 
> folders in use by a running M/R or stale folders. I did notice similar issues 
> opened by other users marked as "resolved", but none seems to exactly match 
> the above behavior. 
> I really hope someone has insights on how to fix it.
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-11 Thread Mark Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092455#comment-15092455
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks Nikita. And, I will be issuing PR's to your kafka09-integration branch 
so it can become the single source of truth until this change gets merged into 
spark. And, I believe Spark community prefers discussion on PRs once they are 
filed, so you'll hear more from me there:-)

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12646) Support _HOST in kerberos principal for connecting to secure cluster

2016-01-11 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092460#comment-15092460
 ] 

Marcelo Vanzin commented on SPARK-12646:


Can you convince people to at least use proper credentials to launch the Spark 
jobs instead of reusing YARN's?

I'm a little wary of adding this feature just to support a broken use case. 
When running on YARN, Spark is a user application, and you're asking for Spark 
to authenticate using service principals. That's kinda wrong, even if it works.

Your code also has a huge problem in that it uses {{InetAddress.getLocalHost}}; 
even if this were a desirable feature, there's no guarantee that's the correct 
host to use at all. On multi-homed machines, for example, which should be the 
address to use when expanding the principal template?

You application can also login to kerberos before launching the Spark job; call 
kinit by yourself and then launch Spark without using "--principal" nor 
"--keytab". Then Spark doesn't need to do anything, it just inherits the 
kerberos ticket from your app.

> Support _HOST in kerberos principal for connecting to secure cluster
> 
>
> Key: SPARK-12646
> URL: https://issues.apache.org/jira/browse/SPARK-12646
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Hari Krishna Dara
>Priority: Minor
>  Labels: security
>
> Hadoop supports _HOST as a token that is dynamically replaced with the actual 
> hostname at the time the kerberos authentication is done. This is supported 
> in many hadoop stacks including YARN. When configuring Spark to connect to 
> secure cluster (e.g., yarn-cluster or yarn-client as master), it would be 
> natural to extend support for this token to Spark as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4389) Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located behind NAT

2016-01-11 Thread Alan Braithwaite (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092481#comment-15092481
 ] 

Alan Braithwaite commented on SPARK-4389:
-

So is there any hope for running spark behind a transparent proxy then?  What 
is the preferred method for running a spark-master in an environment where 
things get dynamically scheduled (mesos+marathon, kubernetes, etc)?

> Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located 
> behind NAT
> -
>
> Key: SPARK-4389
> URL: https://issues.apache.org/jira/browse/SPARK-4389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Priority: Minor
>
> We should set {{akka.remote.netty.tcp.bind-hostname="0.0.0.0"}} in our Akka 
> configuration so that Spark drivers can be located behind NATs / work with 
> weird DNS setups.
> This is blocked by upgrading our Akka version, since this configuration is 
> not present Akka 2.3.4.  There might be a different approach / workaround 
> that works on our current Akka version, though.
> EDIT: this is blocked by Akka 2.4, since this feature is only available in 
> the 2.4 snapshot release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-01-11 Thread Michael Allman (JIRA)

Michael Allman created SPARK-12755:
--

 Summary: Spark may attempt to rebuild application UI before 
finishing writing the event logs in possible race condition
 Key: SPARK-12755
 URL: https://issues.apache.org/jira/browse/SPARK-12755
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: Michael Allman
Priority: Minor


As reported in SPARK-6950, it appears that sometimes the standalone master 
attempts to build an application's historical UI before closing the app's event 
log. This is still an issue for us in 1.5.2+, and I believe I've found the 
underlying cause.

When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:

https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727

and then stops the event logger:

https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727

Though it is difficult to follow the chain of events, one of the sequelae of 
stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
called. This method looks for the application's event logs, and its behavior 
varies based on the existence of an {{.inprogress}} file suffix. In particular, 
a warning is logged if this suffix exists:

https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935

After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
stops the event logger:

https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736

This renames the event log, dropping the {{.inprogress}} file sequence.

As such, a race condition exists where the master may attempt to process the 
application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-01-11 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092489#comment-15092489
 ] 

Michael Allman commented on SPARK-12755:


I'm going to put together a PR that simply reorders the call to stop the event 
logger so that it comes before the call to stop the DAG scheduler.

> Spark may attempt to rebuild application UI before finishing writing the 
> event logs in possible race condition
> --
>
> Key: SPARK-12755
> URL: https://issues.apache.org/jira/browse/SPARK-12755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Michael Allman
>Priority: Minor
>
> As reported in SPARK-6950, it appears that sometimes the standalone master 
> attempts to build an application's historical UI before closing the app's 
> event log. This is still an issue for us in 1.5.2+, and I believe I've found 
> the underlying cause.
> When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> and then stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> Though it is difficult to follow the chain of events, one of the sequelae of 
> stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
> called. This method looks for the application's event logs, and its behavior 
> varies based on the existence of an {{.inprogress}} file suffix. In 
> particular, a warning is logged if this suffix exists:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935
> After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
> stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736
> This renames the event log, dropping the {{.inprogress}} file sequence.
> As such, a race condition exists where the master may attempt to process the 
> application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092509#comment-15092509
 ] 

Apache Spark commented on SPARK-6950:
-

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/10700

> Spark master UI believes some applications are in progress when they are 
> actually completed
> ---
>
> Key: SPARK-6950
> URL: https://issues.apache.org/jira/browse/SPARK-6950
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
> Fix For: 1.3.1
>
>
> In Spark 1.2.x, I was able to set my spark event log directory to be a 
> different location from the default, and after the job finishes, I can replay 
> the UI by clicking on the appropriate link under "Completed Applications".
> Now, on a non-deterministic basis (but seems to happen most of the time), 
> when I click on the link under "Completed Applications", I instead get a 
> webpage that says:
> Application history not found (app-20150415052927-0014)
> Application myApp is still in progress.
> I am able to view the application's UI using the Spark history server, so 
> something regressed in the Spark master code between 1.2 and 1.3, but that 
> regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092554#comment-15092554
 ] 

Apache Spark commented on SPARK-7831:
-

User 'nraychaudhuri' has created a pull request for this issue:
https://github.com/apache/spark/pull/10701

> Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
> --
>
> Key: SPARK-7831
> URL: https://issues.apache.org/jira/browse/SPARK-7831
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source)
>Reporter: Luc Bourlier
>
> To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be 
> running.
> It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher 
> registers as a framework in the Mesos cluster.
> After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the 
> application is correctly terminated locally, but the framework is still 
> listed as {{active}} in the Mesos dashboard.
> I would expect the framework to be de-registered when the dispatcher is 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7831:
---

Assignee: (was: Apache Spark)

> Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
> --
>
> Key: SPARK-7831
> URL: https://issues.apache.org/jira/browse/SPARK-7831
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source)
>Reporter: Luc Bourlier
>
> To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be 
> running.
> It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher 
> registers as a framework in the Mesos cluster.
> After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the 
> application is correctly terminated locally, but the framework is still 
> listed as {{active}} in the Mesos dashboard.
> I would expect the framework to be de-registered when the dispatcher is 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7831:
---

Assignee: Apache Spark

> Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
> --
>
> Key: SPARK-7831
> URL: https://issues.apache.org/jira/browse/SPARK-7831
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source)
>Reporter: Luc Bourlier
>Assignee: Apache Spark
>
> To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be 
> running.
> It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher 
> registers as a framework in the Mesos cluster.
> After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the 
> application is correctly terminated locally, but the framework is still 
> listed as {{active}} in the Mesos dashboard.
> I would expect the framework to be de-registered when the dispatcher is 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092559#comment-15092559
 ] 

Apache Spark commented on SPARK-12732:
--

User 'iyounus' has created a pull request for this issue:
https://github.com/apache/spark/pull/10702

> Fix LinearRegression.train for the case when label is constant and 
> fitIntercept=false
> -
>
> Key: SPARK-12732
> URL: https://issues.apache.org/jira/browse/SPARK-12732
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Imran Younus
>Priority: Minor
>
> If the target variable is constant, then the linear regression must check if 
> the fitIntercept is true or false, and handle these two cases separately.
> If the fitIntercept is true, then there is no training needed and we set the 
> intercept equal to the mean of y.
> But if the fit intercept is false, then the model should still train.
> Currently, LinearRegression handles both cases in the same way. It doesn't 
> train the model and sets the intercept equal to the mean of y. Which, means 
> that it returns a non-zero intercept even when the user forces the regression 
> through the origin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12732:


Assignee: (was: Apache Spark)

> Fix LinearRegression.train for the case when label is constant and 
> fitIntercept=false
> -
>
> Key: SPARK-12732
> URL: https://issues.apache.org/jira/browse/SPARK-12732
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Imran Younus
>Priority: Minor
>
> If the target variable is constant, then the linear regression must check if 
> the fitIntercept is true or false, and handle these two cases separately.
> If the fitIntercept is true, then there is no training needed and we set the 
> intercept equal to the mean of y.
> But if the fit intercept is false, then the model should still train.
> Currently, LinearRegression handles both cases in the same way. It doesn't 
> train the model and sets the intercept equal to the mean of y. Which, means 
> that it returns a non-zero intercept even when the user forces the regression 
> through the origin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12732:


Assignee: Apache Spark

> Fix LinearRegression.train for the case when label is constant and 
> fitIntercept=false
> -
>
> Key: SPARK-12732
> URL: https://issues.apache.org/jira/browse/SPARK-12732
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Imran Younus
>Assignee: Apache Spark
>Priority: Minor
>
> If the target variable is constant, then the linear regression must check if 
> the fitIntercept is true or false, and handle these two cases separately.
> If the fitIntercept is true, then there is no training needed and we set the 
> intercept equal to the mean of y.
> But if the fit intercept is false, then the model should still train.
> Currently, LinearRegression handles both cases in the same way. It doesn't 
> train the model and sets the intercept equal to the mean of y. Which, means 
> that it returns a non-zero intercept even when the user forces the regression 
> through the origin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12714) Transforming Dataset with sequences of case classes to RDD causes Task Not Serializable exception

2016-01-11 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092591#comment-15092591
 ] 

Michael Armbrust commented on SPARK-12714:
--

Would you be able to test with {{branch-1.6}}?  I backported a bunch of fixes 
after the release.

> Transforming Dataset with sequences of case classes to RDD causes Task Not 
> Serializable exception
> -
>
> Key: SPARK-12714
> URL: https://issues.apache.org/jira/browse/SPARK-12714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: linux 3.13.0-24-generic, scala 2.10.6
>Reporter: James Eastwood
>
> Attempting to transform a Dataset of a case class containing a nested 
> sequence of case classes causes an exception to be thrown: 
> `org.apache.spark.SparkException: Task not serializable`.
> Here is a minimum repro:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> case class Top(a: String, nested: Array[Nested])
> case class Nested(b: String)
> object scratch {
>   def main ( args: Array[String] ) {
> lazy val sparkConf = new 
> SparkConf().setAppName("scratch").setMaster("local[1]")
> lazy val sparkContext = new SparkContext(sparkConf)
> lazy val sqlContext = new SQLContext(sparkContext)
> val input = List(
>   """{ "a": "123", "nested": [{ "b": "123" }] }"""
> )
> import sqlContext.implicits._
> val ds = sqlContext.read.json(sparkContext.parallelize(input)).as[Top]
> ds.rdd.foreach(println)
> sparkContext.stop()
>   }
> }
> {code}
> {code}
> scalaVersion := "2.10.6"
> lazy val sparkVersion = "1.6.0"
> libraryDependencies ++= List(
>   "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
>   "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
>   "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
> )
> {code}
> Full stack trace:
> {code}
> [error] (run-main-0) org.apache.spark.SparkException: Task not serializable
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at org.apache.spark.sql.Dataset.rdd(Dataset.scala:166)
>   at scratch$.main(scratch.scala:26)
>   at scratch.main(scratch.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
> Caused by: java.io.NotSerializableException: 
> scala.reflect.internal.Mirrors$Roots$EmptyPackageClass$
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.internal.Mirrors$Roots$EmptyPackageClass$, value: package 
> )
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, )
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$TypeRef$$anon$6, Nested)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2,
>  name: elementType$1, type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2,
>  )
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2$$anonfun$apply$1,
>  name: $outer, type: class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2)
>   - object (class 
>

[jira] [Created] (SPARK-12756) use hash expression in Exchange

2016-01-11 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-12756:
---

 Summary: use hash expression in Exchange
 Key: SPARK-12756
 URL: https://issues.apache.org/jira/browse/SPARK-12756
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12740) grouping()/grouping_id() should work with having and order by

2016-01-11 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092602#comment-15092602
 ] 

Davies Liu commented on SPARK-12740:


They will be introduced by https://github.com/apache/spark/pull/10677, 

> grouping()/grouping_id() should work with having and order by
> -
>
> Key: SPARK-12740
> URL: https://issues.apache.org/jira/browse/SPARK-12740
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> The following query should work
> {code}
> select a, b, sum(c) from t group by cube(a, b) having grouping(a) = 0 order 
> by grouping_id(a, b)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12687) Support from clause surrounded by `()`

2016-01-11 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12687:
---
Assignee: Liang-Chi Hsieh

> Support from clause surrounded by `()`
> --
>
> Key: SPARK-12687
> URL: https://issues.apache.org/jira/browse/SPARK-12687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.0
>
>
> This query can't be parsed under Hive parser:
> {code}
> (select * from t1) union (select * from t2)
> {code}
> also this one:
> {code}
> select * from ((select * from t1) union (select * from t2)) t
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12756) use hash expression in Exchange

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12756:


Assignee: (was: Apache Spark)

> use hash expression in Exchange
> ---
>
> Key: SPARK-12756
> URL: https://issues.apache.org/jira/browse/SPARK-12756
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12756) use hash expression in Exchange

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12756:


Assignee: Apache Spark

> use hash expression in Exchange
> ---
>
> Key: SPARK-12756
> URL: https://issues.apache.org/jira/browse/SPARK-12756
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12756) use hash expression in Exchange

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092603#comment-15092603
 ] 

Apache Spark commented on SPARK-12756:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10703

> use hash expression in Exchange
> ---
>
> Key: SPARK-12756
> URL: https://issues.apache.org/jira/browse/SPARK-12756
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12700) SortMergeJoin and BroadcastHashJoin should support condition

2016-01-11 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12700:
--

Assignee: Davies Liu

> SortMergeJoin and BroadcastHashJoin should support condition
> 
>
> Key: SPARK-12700
> URL: https://issues.apache.org/jira/browse/SPARK-12700
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, We use a Filter follow SortMergeJoin or BroadcastHashJoin for 
> conditions, the result projection of join could be very expensive if they 
> generate lots of rows (could be reduce mostly by condition).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12734) Fix Netty exclusions and use Maven Enforcer to prevent bug from being reintroduced

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092616#comment-15092616
 ] 

Apache Spark commented on SPARK-12734:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10704

> Fix Netty exclusions and use Maven Enforcer to prevent bug from being 
> reintroduced
> --
>
> Key: SPARK-12734
> URL: https://issues.apache.org/jira/browse/SPARK-12734
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
>
> Netty classes are published under artifacts with different names, so our 
> build needs to exclude the {{org.jboss.netty:netty}} versions of the Netty 
> artifact. However, our existing exclusions were incomplete, leading to 
> situations where duplicate Netty classes would wind up on the classpath and 
> cause compile errors (or worse).
> We should fix this and should also start using Maven Enforcer's dependency 
> banning mechanisms to prevent this problem from ever being reintroduced.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-01-11 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092620#comment-15092620
 ] 

Sean Owen commented on SPARK-12755:
---

Is this the same as SPARK-6950? if you have more detail here you should reopen 
it rather than make a new JIRA.

> Spark may attempt to rebuild application UI before finishing writing the 
> event logs in possible race condition
> --
>
> Key: SPARK-12755
> URL: https://issues.apache.org/jira/browse/SPARK-12755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Michael Allman
>Priority: Minor
>
> As reported in SPARK-6950, it appears that sometimes the standalone master 
> attempts to build an application's historical UI before closing the app's 
> event log. This is still an issue for us in 1.5.2+, and I believe I've found 
> the underlying cause.
> When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> and then stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> Though it is difficult to follow the chain of events, one of the sequelae of 
> stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
> called. This method looks for the application's event logs, and its behavior 
> varies based on the existence of an {{.inprogress}} file suffix. In 
> particular, a warning is logged if this suffix exists:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935
> After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
> stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736
> This renames the event log, dropping the {{.inprogress}} file sequence.
> As such, a race condition exists where the master may attempt to process the 
> application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads

2016-01-11 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-12757:
--

 Summary: Use reference counting to prevent blocks from being 
evicted during reads
 Key: SPARK-12757
 URL: https://issues.apache.org/jira/browse/SPARK-12757
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Reporter: Josh Rosen
Assignee: Josh Rosen


As a pre-requisite to off-heap caching of blocks, we need a mechanism to 
prevent pages / blocks from being evicted while they are being read. With 
on-heap objects, evicting a block while it is being read merely leads to 
memory-accounting problems (because we assume that an evicted block is a 
candidate for garbage-collection, which will not be true during a read), but 
with off-heap memory this will lead to either data corruption or segmentation 
faults.

To address this, we should add a reference-counting mechanism to track which 
blocks/pages are being read in order to prevent them from being evicted 
prematurely. I propose to do this in two phases: first, add a safe, 
conservative approach in which all BlockManager.get*() calls implicitly 
increment the reference count of blocks and where tasks' references are 
automatically freed upon task completion. This will be correct but may have 
adverse performance impacts because it will prevent legitimate block evictions. 
In phase two, we should incrementally add release() calls in order to fix the 
eviction of unreferenced blocks. The latter change may need to touch many 
different components, which is why I propose to do it separately in order to 
make the changes easier to reason about and review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-01-11 Thread Michael Allman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092644#comment-15092644
 ] 

Michael Allman commented on SPARK-12755:


I think they have the same root cause. If I reopen SPARK-6950, where should I 
put what I put in this ticket's description?

> Spark may attempt to rebuild application UI before finishing writing the 
> event logs in possible race condition
> --
>
> Key: SPARK-12755
> URL: https://issues.apache.org/jira/browse/SPARK-12755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Michael Allman
>Priority: Minor
>
> As reported in SPARK-6950, it appears that sometimes the standalone master 
> attempts to build an application's historical UI before closing the app's 
> event log. This is still an issue for us in 1.5.2+, and I believe I've found 
> the underlying cause.
> When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> and then stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> Though it is difficult to follow the chain of events, one of the sequelae of 
> stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
> called. This method looks for the application's event logs, and its behavior 
> varies based on the existence of an {{.inprogress}} file suffix. In 
> particular, a warning is logged if this suffix exists:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935
> After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
> stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736
> This renames the event log, dropping the {{.inprogress}} file sequence.
> As such, a race condition exists where the master may attempt to process the 
> application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12755:


Assignee: (was: Apache Spark)

> Spark may attempt to rebuild application UI before finishing writing the 
> event logs in possible race condition
> --
>
> Key: SPARK-12755
> URL: https://issues.apache.org/jira/browse/SPARK-12755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Michael Allman
>Priority: Minor
>
> As reported in SPARK-6950, it appears that sometimes the standalone master 
> attempts to build an application's historical UI before closing the app's 
> event log. This is still an issue for us in 1.5.2+, and I believe I've found 
> the underlying cause.
> When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> and then stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> Though it is difficult to follow the chain of events, one of the sequelae of 
> stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
> called. This method looks for the application's event logs, and its behavior 
> varies based on the existence of an {{.inprogress}} file suffix. In 
> particular, a warning is logged if this suffix exists:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935
> After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
> stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736
> This renames the event log, dropping the {{.inprogress}} file sequence.
> As such, a race condition exists where the master may attempt to process the 
> application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092645#comment-15092645
 ] 

Apache Spark commented on SPARK-12755:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/10700

> Spark may attempt to rebuild application UI before finishing writing the 
> event logs in possible race condition
> --
>
> Key: SPARK-12755
> URL: https://issues.apache.org/jira/browse/SPARK-12755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Michael Allman
>Priority: Minor
>
> As reported in SPARK-6950, it appears that sometimes the standalone master 
> attempts to build an application's historical UI before closing the app's 
> event log. This is still an issue for us in 1.5.2+, and I believe I've found 
> the underlying cause.
> When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> and then stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> Though it is difficult to follow the chain of events, one of the sequelae of 
> stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
> called. This method looks for the application's event logs, and its behavior 
> varies based on the existence of an {{.inprogress}} file suffix. In 
> particular, a warning is logged if this suffix exists:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935
> After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
> stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736
> This renames the event log, dropping the {{.inprogress}} file sequence.
> As such, a race condition exists where the master may attempt to process the 
> application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12755:


Assignee: Apache Spark

> Spark may attempt to rebuild application UI before finishing writing the 
> event logs in possible race condition
> --
>
> Key: SPARK-12755
> URL: https://issues.apache.org/jira/browse/SPARK-12755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Michael Allman
>Assignee: Apache Spark
>Priority: Minor
>
> As reported in SPARK-6950, it appears that sometimes the standalone master 
> attempts to build an application's historical UI before closing the app's 
> event log. This is still an issue for us in 1.5.2+, and I believe I've found 
> the underlying cause.
> When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> and then stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> Though it is difficult to follow the chain of events, one of the sequelae of 
> stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
> called. This method looks for the application's event logs, and its behavior 
> varies based on the existence of an {{.inprogress}} file suffix. In 
> particular, a warning is logged if this suffix exists:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935
> After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
> stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736
> This renames the event log, dropping the {{.inprogress}} file sequence.
> As such, a race condition exists where the master may attempt to process the 
> application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092646#comment-15092646
 ] 

Apache Spark commented on SPARK-12757:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10705

> Use reference counting to prevent blocks from being evicted during reads
> 
>
> Key: SPARK-12757
> URL: https://issues.apache.org/jira/browse/SPARK-12757
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> As a pre-requisite to off-heap caching of blocks, we need a mechanism to 
> prevent pages / blocks from being evicted while they are being read. With 
> on-heap objects, evicting a block while it is being read merely leads to 
> memory-accounting problems (because we assume that an evicted block is a 
> candidate for garbage-collection, which will not be true during a read), but 
> with off-heap memory this will lead to either data corruption or segmentation 
> faults.
> To address this, we should add a reference-counting mechanism to track which 
> blocks/pages are being read in order to prevent them from being evicted 
> prematurely. I propose to do this in two phases: first, add a safe, 
> conservative approach in which all BlockManager.get*() calls implicitly 
> increment the reference count of blocks and where tasks' references are 
> automatically freed upon task completion. This will be correct but may have 
> adverse performance impacts because it will prevent legitimate block 
> evictions. In phase two, we should incrementally add release() calls in order 
> to fix the eviction of unreferenced blocks. The latter change may need to 
> touch many different components, which is why I propose to do it separately 
> in order to make the changes easier to reason about and review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12757:


Assignee: Apache Spark  (was: Josh Rosen)

> Use reference counting to prevent blocks from being evicted during reads
> 
>
> Key: SPARK-12757
> URL: https://issues.apache.org/jira/browse/SPARK-12757
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> As a pre-requisite to off-heap caching of blocks, we need a mechanism to 
> prevent pages / blocks from being evicted while they are being read. With 
> on-heap objects, evicting a block while it is being read merely leads to 
> memory-accounting problems (because we assume that an evicted block is a 
> candidate for garbage-collection, which will not be true during a read), but 
> with off-heap memory this will lead to either data corruption or segmentation 
> faults.
> To address this, we should add a reference-counting mechanism to track which 
> blocks/pages are being read in order to prevent them from being evicted 
> prematurely. I propose to do this in two phases: first, add a safe, 
> conservative approach in which all BlockManager.get*() calls implicitly 
> increment the reference count of blocks and where tasks' references are 
> automatically freed upon task completion. This will be correct but may have 
> adverse performance impacts because it will prevent legitimate block 
> evictions. In phase two, we should incrementally add release() calls in order 
> to fix the eviction of unreferenced blocks. The latter change may need to 
> touch many different components, which is why I propose to do it separately 
> in order to make the changes easier to reason about and review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12757:


Assignee: Josh Rosen  (was: Apache Spark)

> Use reference counting to prevent blocks from being evicted during reads
> 
>
> Key: SPARK-12757
> URL: https://issues.apache.org/jira/browse/SPARK-12757
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> As a pre-requisite to off-heap caching of blocks, we need a mechanism to 
> prevent pages / blocks from being evicted while they are being read. With 
> on-heap objects, evicting a block while it is being read merely leads to 
> memory-accounting problems (because we assume that an evicted block is a 
> candidate for garbage-collection, which will not be true during a read), but 
> with off-heap memory this will lead to either data corruption or segmentation 
> faults.
> To address this, we should add a reference-counting mechanism to track which 
> blocks/pages are being read in order to prevent them from being evicted 
> prematurely. I propose to do this in two phases: first, add a safe, 
> conservative approach in which all BlockManager.get*() calls implicitly 
> increment the reference count of blocks and where tasks' references are 
> automatically freed upon task completion. This will be correct but may have 
> adverse performance impacts because it will prevent legitimate block 
> evictions. In phase two, we should incrementally add release() calls in order 
> to fix the eviction of unreferenced blocks. The latter change may need to 
> touch many different components, which is why I propose to do it separately 
> in order to make the changes easier to reason about and review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12543) Support subquery in select/where/having

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12543:


Assignee: Apache Spark  (was: Davies Liu)

> Support subquery in select/where/having
> ---
>
> Key: SPARK-12543
> URL: https://issues.apache.org/jira/browse/SPARK-12543
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12543) Support subquery in select/where/having

2016-01-11 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092653#comment-15092653
 ] 

Apache Spark commented on SPARK-12543:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10706

> Support subquery in select/where/having
> ---
>
> Key: SPARK-12543
> URL: https://issues.apache.org/jira/browse/SPARK-12543
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12543) Support subquery in select/where/having

2016-01-11 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12543:


Assignee: Davies Liu  (was: Apache Spark)

> Support subquery in select/where/having
> ---
>
> Key: SPARK-12543
> URL: https://issues.apache.org/jira/browse/SPARK-12543
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values

2016-01-11 Thread Anatoliy Plastinin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092663#comment-15092663
 ] 

Anatoliy Plastinin commented on SPARK-12744:


[~yhuai] How about: _"The semantics of reading JSON integer as timestamp 
(explicitly defined by schema) has been changed, the integer value is treated 
as number of seconds instead of milliseconds"_?

> Inconsistent behavior parsing JSON with unix timestamp values
> -
>
> Key: SPARK-12744
> URL: https://issues.apache.org/jira/browse/SPARK-12744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Anatoliy Plastinin
>Assignee: Anatoliy Plastinin
>Priority: Minor
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Let’s have following json
> {code}
> val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil)
> {code}
> Spark sql casts int to timestamp treating int value as a number of seconds.
> https://issues.apache.org/jira/browse/SPARK-11724
> {code}
> scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show
> ++
> |  ts|
> ++
> |2016-01-10 01:37:...|
> ++
> {code}
> However parsing json with schema gives different result
> {code}
> scala> val schema = (new StructType).add("ts", TimestampType)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(ts,TimestampType,true))
> scala> sqlContext.read.schema(schema).json(rdd).show
> ++
> |  ts|
> ++
> |1970-01-17 20:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 224 matches

Mail list logo