[jira] [Resolved] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.

2015-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9570.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8968
[https://github.com/apache/spark/pull/8968]

> Consistent recommendation for submitting spark apps to YARN, -master yarn 
> --deploy-mode x vs -master yarn-x'.
> -
>
> Key: SPARK-9570
> URL: https://issues.apache.org/jira/browse/SPARK-9570
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Submit, YARN
>Affects Versions: 1.4.1
>Reporter: Neelesh Srinivas Salian
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>
> There are still some inconsistencies in the documentation regarding 
> submission of the applications for yarn.
> SPARK-3629 was done to correct the same but 
> http://spark.apache.org/docs/latest/submitting-applications.html#master-urls
> still has yarn-client and yarn-client as opposed to the nor of having 
> --master yarn and --deploy-mode cluster / client
> Need to change this appropriately (if needed) to avoid confusion:
> https://spark.apache.org/docs/latest/running-on-yarn.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9570) Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.

2015-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-9570:


Assignee: Sean Owen

> Consistent recommendation for submitting spark apps to YARN, -master yarn 
> --deploy-mode x vs -master yarn-x'.
> -
>
> Key: SPARK-9570
> URL: https://issues.apache.org/jira/browse/SPARK-9570
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Submit, YARN
>Affects Versions: 1.4.1
>Reporter: Neelesh Srinivas Salian
>Assignee: Sean Owen
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>
> There are still some inconsistencies in the documentation regarding 
> submission of the applications for yarn.
> SPARK-3629 was done to correct the same but 
> http://spark.apache.org/docs/latest/submitting-applications.html#master-urls
> still has yarn-client and yarn-client as opposed to the nor of having 
> --master yarn and --deploy-mode cluster / client
> Need to change this appropriately (if needed) to avoid confusion:
> https://spark.apache.org/docs/latest/running-on-yarn.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10889) Upgrade Kinesis Client Library

2015-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10889.
---
   Resolution: Fixed
 Assignee: Avrohom Katz
Fix Version/s: 1.6.0
   1.5.2

It does in fact look like a maintenance release, with doc updates and one new 
metric. Hence OK for a maintenance release IMHO, so I put it into 1.5.2 as 
well. It doesn't merge into 1.4 cleanly so I didn't merge back to 1.4.

> Upgrade Kinesis Client Library
> --
>
> Key: SPARK-10889
> URL: https://issues.apache.org/jira/browse/SPARK-10889
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Avrohom Katz
>Assignee: Avrohom Katz
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> Kinesis Client Library added a custom cloudwatch metric in 1.3.0 called 
> MillisBehindLatest. This is very important for capacity planning and alerting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10798) JsonMappingException with Spark Context Parallelize

2015-10-04 Thread Dev Lakhani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942651#comment-14942651
 ] 

Dev Lakhani commented on SPARK-10798:
-

Hi Miao

I will create a github project/fork for this to give you the full sample soon.

Thanks
Dev

> JsonMappingException with Spark Context Parallelize
> ---
>
> Key: SPARK-10798
> URL: https://issues.apache.org/jira/browse/SPARK-10798
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: Linux, Java 1.8.45
>Reporter: Dev Lakhani
>
> When trying to create an RDD of Rows using a Java Spark Context and if I 
> serialize the rows with Kryo first, the sparkContext fails.
> byte[] data= Kryo.serialize(List)
> List fromKryoRows=Kryo.unserialize(data)
> List rows= new Vector(); //using a new set of data.
> rows.add(RowFactory.create("test"));
> javaSparkContext.parallelize(rows);
> OR
> javaSparkContext.parallelize(fromKryoRows); //using deserialized rows
> I get :
> com.fasterxml.jackson.databind.JsonMappingException: (None,None) (of class 
> scala.Tuple2) (through reference chain: 
> org.apache.spark.rdd.RDDOperationScope["parent"])
>at 
> com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:210)
>at 
> com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(JsonMappingException.java:177)
>at 
> com.fasterxml.jackson.databind.ser.std.StdSerializer.wrapAndThrow(StdSerializer.java:187)
>at 
> com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:647)
>at 
> com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:152)
>at 
> com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
>at 
> com.fasterxml.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:2881)
>at 
> com.fasterxml.jackson.databind.ObjectMapper.writeValueAsString(ObjectMapper.java:2338)
>at 
> org.apache.spark.rdd.RDDOperationScope.toJson(RDDOperationScope.scala:50)
>at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:141)
>at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>at 
> org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>at 
> org.apache.spark.SparkContext.parallelize(SparkContext.scala:714)
>at 
> org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:145)
>at 
> org.apache.spark.api.java.JavaSparkContext.parallelize(JavaSparkContext.scala:157)
>...
> Caused by: scala.MatchError: (None,None) (of class scala.Tuple2)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply$mcV$sp(OptionSerializerModule.scala:32)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer$$anonfun$serialize$1.apply(OptionSerializerModule.scala:32)
>at scala.Option.getOrElse(Option.scala:120)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:31)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionSerializer.serialize(OptionSerializerModule.scala:22)
>at 
> com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:505)
>at 
> com.fasterxml.jackson.module.scala.ser.OptionPropertyWriter.serializeAsField(OptionSerializerModule.scala:128)
>at 
> com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:639)
>... 19 more
> I've tried updating jackson module scala to 2.6.1 but same issue. This 
> happens in local mode with java 1.8_45. I searched the web and this Jira for 
> similar issues but found nothing of interest.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2015-10-04 Thread Ofer Mendelevitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942671#comment-14942671
 ] 

Ofer Mendelevitch commented on SPARK-10892:
---

Thanks Yin - confirmed this is still an issue with latest 1.5 branch

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> The output is:
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|0|2012|
> |20121020| 20|  PRCP|   10|USW00023272|0|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMIN|   10|USW00023272|  139|2012|
> |20121002|  2|  TMIN|   10|USW00023272|  178|2012|
> |20121003|  3|  TMIN|   10|USW00023272|  144|2012|
> |20121004|  4|  TMIN|   10|USW00023272|  144|2012|
> |20121005|  5|  TMIN|   10|USW00023272|  139|2012|
> |20121006|  6|  TMIN|   10|USW00023272|  128|2012|
> |20121007|  7|  TMIN|   10|USW00023272|  122|2012|
> |20121008|  8|  TMIN|   10|USW00023272|  122|2012|
> |20121009|  9|  TMIN|   10|USW00023272|  139|2012|
> |20121010| 10|  TMIN|   10|USW00023272|  128|2012|
> |20121011| 11|  TMIN|   10|USW00023272|  122|2012|
> |20121012| 12|  TMIN|   10|USW00023272|  117|2012|
> |20121013| 13|  TMIN|   10|USW00023272|  122|2012|
> |20121014| 14|  TMIN|   10|USW00023272|  128|2012|
> |20121015| 15|  TMIN|   10|USW00023272|  128|2012|
> |20121016| 16|  TMIN|   10|USW00023272|  156|2012|
> |20121017| 17|  TMIN|   10|USW00023272|  139|2012|
> |20121018| 18|  TMIN|   10|USW00023272|  161|2012|
> |20121019| 19|  TMIN|   10|USW00023272|  133|2012|
> |20121020| 20|  TMIN|   10|USW00023272|  122|2012|
> ++---+--+-+---+-+——+
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  TMAX|   10|USW00023272|  322|2012|
> |20121002|  2|  TMAX|   10|USW00023272|  344|2012|
> |20121003|  3|  TMAX|   10|USW00023272|  222|2012|
> |20121004|  4|  TMAX|   10|USW00023272|  189|2012|
> |20121005|  5|  TMAX|   10|USW00023272|  194|2012|
> |201

[jira] [Created] (SPARK-10919) Assosiation rules class should return the support of each rule

2015-10-04 Thread Tofigh (JIRA)
Tofigh created SPARK-10919:
--

 Summary: Assosiation rules class should return the support of each 
rule
 Key: SPARK-10919
 URL: https://issues.apache.org/jira/browse/SPARK-10919
 Project: Spark
  Issue Type: Improvement
Reporter: Tofigh
Priority: Minor


The current implementation of Association rule does not return the frequency of 
appearance of each rule. This piece of information is essential for 
implementing functional dependency on the output of AR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10919) Assosiation rules class should return the support of each rule

2015-10-04 Thread Tofigh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tofigh updated SPARK-10919:
---
Description: 
The current implementation of Association rule does not return the frequency of 
appearance of each rule. This piece of information is essential for 
implementing functional dependency on the output of AR. In order to return the 
frequency (support) of each rule,   freqUnion: Double,
 and  freqAntecedent: Double should be:  val freqUnion: Double, val 
freqAntecedent: Double

  was:The current implementation of Association rule does not return the 
frequency of appearance of each rule. This piece of information is essential 
for implementing functional dependency on the output of AR.


> Assosiation rules class should return the support of each rule
> --
>
> Key: SPARK-10919
> URL: https://issues.apache.org/jira/browse/SPARK-10919
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tofigh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The current implementation of Association rule does not return the frequency 
> of appearance of each rule. This piece of information is essential for 
> implementing functional dependency on the output of AR. In order to return 
> the frequency (support) of each rule,   freqUnion: Double,
>  and  freqAntecedent: Double should be:  val freqUnion: Double, val 
> freqAntecedent: Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10919) Assosiation rules class should return the support of each rule

2015-10-04 Thread Tofigh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tofigh updated SPARK-10919:
---
Description: The current implementation of Association rule does not return 
the frequency of appearance of each rule. This piece of information is 
essential for implementing functional dependency on the output of AR. In order 
to return the frequency (support) of each rule,   freqUnion: Double, and  
freqAntecedent: Double should be:  val freqUnion: Double, val freqAntecedent: 
Double  (was: The current implementation of Association rule does not return 
the frequency of appearance of each rule. This piece of information is 
essential for implementing functional dependency on the output of AR. In order 
to return the frequency (support) of each rule,   freqUnion: Double,
 and  freqAntecedent: Double should be:  val freqUnion: Double, val 
freqAntecedent: Double)

> Assosiation rules class should return the support of each rule
> --
>
> Key: SPARK-10919
> URL: https://issues.apache.org/jira/browse/SPARK-10919
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tofigh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The current implementation of Association rule does not return the frequency 
> of appearance of each rule. This piece of information is essential for 
> implementing functional dependency on the output of AR. In order to return 
> the frequency (support) of each rule,   freqUnion: Double, and  
> freqAntecedent: Double should be:  val freqUnion: Double, val freqAntecedent: 
> Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10920) another constructor for FPGrowth algorithm to support the absolute value for support

2015-10-04 Thread Tofigh (JIRA)
Tofigh created SPARK-10920:
--

 Summary: another constructor for FPGrowth algorithm to support the 
absolute value for support
 Key: SPARK-10920
 URL: https://issues.apache.org/jira/browse/SPARK-10920
 Project: Spark
  Issue Type: Improvement
Reporter: Tofigh
Priority: Minor


The current implementation only accepts the support in percentage and then 
count the number of samples again and convert it back to the absolute value. It 
is better to have another constructor that directly takes the absolute value of 
 support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10920) another constructor for FPGrowth algorithm to support the absolute value for support

2015-10-04 Thread Tofigh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tofigh updated SPARK-10920:
---
Description: The current implementation takes the support value in 
percentage and then counts the number of samples again to convert it back to 
absolute value. It is better to have another constructor that directly takes 
the absolute value for the support.  (was: The current implementation only 
accepts the support in percentage and then count the number of samples again 
and convert it back to the absolute value. It is better to have another 
constructor that directly takes the absolute value of  support.)

> another constructor for FPGrowth algorithm to support the absolute value for 
> support
> 
>
> Key: SPARK-10920
> URL: https://issues.apache.org/jira/browse/SPARK-10920
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tofigh
>Priority: Minor
>
> The current implementation takes the support value in percentage and then 
> counts the number of samples again to convert it back to absolute value. It 
> is better to have another constructor that directly takes the absolute value 
> for the support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10919) Assosiation rules class should return the support of each rule

2015-10-04 Thread Tofigh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tofigh updated SPARK-10919:
---
Description: The current implementation of Association rule does not return 
the frequency of appearance of each rule. This piece of information is 
essential for implementing functional dependency on top of the AR. In order to 
return the frequency (support) of each rule,   freqUnion: Double, and  
freqAntecedent: Double should be:  val freqUnion: Double, val freqAntecedent: 
Double  (was: The current implementation of Association rule does not return 
the frequency of appearance of each rule. This piece of information is 
essential for implementing functional dependency on the output of AR. In order 
to return the frequency (support) of each rule,   freqUnion: Double, and  
freqAntecedent: Double should be:  val freqUnion: Double, val freqAntecedent: 
Double)

> Assosiation rules class should return the support of each rule
> --
>
> Key: SPARK-10919
> URL: https://issues.apache.org/jira/browse/SPARK-10919
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tofigh
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The current implementation of Association rule does not return the frequency 
> of appearance of each rule. This piece of information is essential for 
> implementing functional dependency on top of the AR. In order to return the 
> frequency (support) of each rule,   freqUnion: Double, and  freqAntecedent: 
> Double should be:  val freqUnion: Double, val freqAntecedent: Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-04 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-10921:
---

 Summary: Completely remove the use of 
SparkContext.preferredNodeLocationData
 Key: SPARK-10921
 URL: https://issues.apache.org/jira/browse/SPARK-10921
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.5.1
Reporter: Jacek Laskowski
Priority: Minor


SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} yet 
the code makes it less obvious as it says (see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):

{code}
  // This is used only by YARN for now, but should be relevant to other cluster 
types (Mesos,
  // etc) too. This is typically generated from 
InputFormatInfo.computePreferredLocations. It
  // contains a map from hostname to a list of input format splits on the host.
  private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
Map()
{code}

It turns out that there are places where the initialization does take place 
that only adds up to the confusion.

When you search for the use of {{SparkContext.preferredNodeLocationData}},
you'll find 3 places - one constructor marked {{@deprecated}}, the other with
{{logWarning}} telling us that _"Passing in preferred locations has no
effect at all, see SPARK-8949"_, and in
{{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.

There is no consistent approach to deal with it given it's no longer used in 
theory.

[org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
 method
caught my eye and I found that it does the following in
client.register:

{code}
if (sc != null) sc.preferredNodeLocationData else Map()
{code}

However, {{client.register}} [ignores the input parameter 
completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
 but the scaladoc says (note {{preferredNodeLocations}} param):

{code}
  /**
   * Registers the application master with the RM.
   *
   * @param conf The Yarn configuration.
   * @param sparkConf The Spark configuration.
   * @param preferredNodeLocations Map with hints about where to allocate 
containers.
   * @param uiAddress Address of the SparkUI.
   * @param uiHistoryAddress Address of the application on the History Server.
   */
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942688#comment-14942688
 ] 

Sean Owen commented on SPARK-10921:
---

I believe that any cleanup that doesn't change the binary signature -- which 
MiMa will detect -- is OK. The spirit of SPARK-8949 seems to be to remove this. 
This may entail removing that unused field, other private members and args, 
deprecating public methods that take the param, and updating docs accordingly.

> Completely remove the use of SparkContext.preferredNodeLocationData
> ---
>
> Key: SPARK-10921
> URL: https://issues.apache.org/jira/browse/SPARK-10921
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.5.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} 
> yet the code makes it less obvious as it says (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):
> {code}
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
>   // contains a map from hostname to a list of input format splits on the 
> host.
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> {code}
> It turns out that there are places where the initialization does take place 
> that only adds up to the confusion.
> When you search for the use of {{SparkContext.preferredNodeLocationData}},
> you'll find 3 places - one constructor marked {{@deprecated}}, the other with
> {{logWarning}} telling us that _"Passing in preferred locations has no
> effect at all, see SPARK-8949"_, and in
> {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.
> There is no consistent approach to deal with it given it's no longer used in 
> theory.
> [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
>  method
> caught my eye and I found that it does the following in
> client.register:
> {code}
> if (sc != null) sc.preferredNodeLocationData else Map()
> {code}
> However, {{client.register}} [ignores the input parameter 
> completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
>  but the scaladoc says (note {{preferredNodeLocations}} param):
> {code}
>   /**
>* Registers the application master with the RM.
>*
>* @param conf The Yarn configuration.
>* @param sparkConf The Spark configuration.
>* @param preferredNodeLocations Map with hints about where to allocate 
> containers.
>* @param uiAddress Address of the SparkUI.
>* @param uiHistoryAddress Address of the application on the History Server.
>*/
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10920) another constructor for FPGrowth algorithm to support the absolute value for support

2015-10-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10920.
---
Resolution: Duplicate

> another constructor for FPGrowth algorithm to support the absolute value for 
> support
> 
>
> Key: SPARK-10920
> URL: https://issues.apache.org/jira/browse/SPARK-10920
> Project: Spark
>  Issue Type: Improvement
>Reporter: Tofigh
>Priority: Minor
>
> The current implementation takes the support value in percentage and then 
> counts the number of samples again to convert it back to absolute value. It 
> is better to have another constructor that directly takes the absolute value 
> for the support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10922) Core SparkTestSuite fail

2015-10-04 Thread JIRA
Jean-Baptiste Onofré created SPARK-10922:


 Summary: Core SparkTestSuite fail
 Key: SPARK-10922
 URL: https://issues.apache.org/jira/browse/SPARK-10922
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Jean-Baptiste Onofré


Core SparkTestSuite currently fails on master:

- includes jars passed in through --jars *** FAILED ***
- includes jars passed in through --packages *** FAILED ***
- sendWithReply: remotely *** FAILED ***

I'm taking a look (it looks good on Jenkins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10922) Core SparkTestSuite fail

2015-10-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942704#comment-14942704
 ] 

Jean-Baptiste Onofré commented on SPARK-10922:
--

Ah, it looks related to tmp filesystem full on my system. Let me double check.

> Core SparkTestSuite fail
> 
>
> Key: SPARK-10922
> URL: https://issues.apache.org/jira/browse/SPARK-10922
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Jean-Baptiste Onofré
>
> Core SparkTestSuite currently fails on master:
> - includes jars passed in through --jars *** FAILED ***
> - includes jars passed in through --packages *** FAILED ***
> - sendWithReply: remotely *** FAILED ***
> I'm taking a look (it looks good on Jenkins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-10-04 Thread Narisu Tao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942708#comment-14942708
 ] 

Narisu Tao commented on SPARK-4036:
---

What is the current status of the ticket? Is it still possible to participate 
in developing this feature?

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib

2015-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10669:


Assignee: Apache Spark

> Link to each language's API in codetabs in ML docs: spark.mllib
> ---
>
> Key: SPARK-10669
> URL: https://issues.apache.org/jira/browse/SPARK-10669
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> In the Markdown docs for the spark.mllib Programming Guide, we have code 
> examples with codetabs for each language.  We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this.  For an example of what we want to do, see the "ChiSqSelector" section 
> in 
> [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md]
> This JIRA is just for spark.mllib, not spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib

2015-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942719#comment-14942719
 ] 

Apache Spark commented on SPARK-10669:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/8974

> Link to each language's API in codetabs in ML docs: spark.mllib
> ---
>
> Key: SPARK-10669
> URL: https://issues.apache.org/jira/browse/SPARK-10669
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.mllib Programming Guide, we have code 
> examples with codetabs for each language.  We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this.  For an example of what we want to do, see the "ChiSqSelector" section 
> in 
> [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md]
> This JIRA is just for spark.mllib, not spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10669) Link to each language's API in codetabs in ML docs: spark.mllib

2015-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10669:


Assignee: (was: Apache Spark)

> Link to each language's API in codetabs in ML docs: spark.mllib
> ---
>
> Key: SPARK-10669
> URL: https://issues.apache.org/jira/browse/SPARK-10669
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> In the Markdown docs for the spark.mllib Programming Guide, we have code 
> examples with codetabs for each language.  We should link to each language's 
> API docs within the corresponding codetab, but we are inconsistent about 
> this.  For an example of what we want to do, see the "ChiSqSelector" section 
> in 
> [https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md]
> This JIRA is just for spark.mllib, not spark.ml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10922) Core SparkTestSuite fail

2015-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942725#comment-14942725
 ] 

Sean Owen commented on SPARK-10922:
---

Some tests are simply flaky; it's a constant battle to improve them. You should 
try re-running first. Since Jenkins jobs are all passing now, I think I'd 
assume it's transient or your environment, yes.

> Core SparkTestSuite fail
> 
>
> Key: SPARK-10922
> URL: https://issues.apache.org/jira/browse/SPARK-10922
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Jean-Baptiste Onofré
>
> Core SparkTestSuite currently fails on master:
> - includes jars passed in through --jars *** FAILED ***
> - includes jars passed in through --packages *** FAILED ***
> - sendWithReply: remotely *** FAILED ***
> I'm taking a look (it looks good on Jenkins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942727#comment-14942727
 ] 

Sean Owen commented on SPARK-4036:
--

This is all there is to the discussion, so I don't believe this will proceed.

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10922) Core SparkTestSuite fail

2015-10-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942750#comment-14942750
 ] 

Jean-Baptiste Onofré commented on SPARK-10922:
--

Thanks Sean, I've relaunched a complete build. I will keep you posted. I'm also 
trying to improve the tests as I'm working on some Jira.

> Core SparkTestSuite fail
> 
>
> Key: SPARK-10922
> URL: https://issues.apache.org/jira/browse/SPARK-10922
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Jean-Baptiste Onofré
>
> Core SparkTestSuite currently fails on master:
> - includes jars passed in through --jars *** FAILED ***
> - includes jars passed in through --packages *** FAILED ***
> - sendWithReply: remotely *** FAILED ***
> I'm taking a look (it looks good on Jenkins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9478) Add class weights to Random Forest

2015-10-04 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942754#comment-14942754
 ] 

Meihua Wu commented on SPARK-9478:
--

[~pcrenshaw] Are you working on this? If not, I can send a PR based on 
[~josephkb]'s suggestions. 

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-04 Thread Daljeet Virdi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942807#comment-14942807
 ] 

Daljeet Virdi commented on SPARK-10921:
---

I can work on this. Can I be assigned?

> Completely remove the use of SparkContext.preferredNodeLocationData
> ---
>
> Key: SPARK-10921
> URL: https://issues.apache.org/jira/browse/SPARK-10921
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.5.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} 
> yet the code makes it less obvious as it says (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):
> {code}
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
>   // contains a map from hostname to a list of input format splits on the 
> host.
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> {code}
> It turns out that there are places where the initialization does take place 
> that only adds up to the confusion.
> When you search for the use of {{SparkContext.preferredNodeLocationData}},
> you'll find 3 places - one constructor marked {{@deprecated}}, the other with
> {{logWarning}} telling us that _"Passing in preferred locations has no
> effect at all, see SPARK-8949"_, and in
> {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.
> There is no consistent approach to deal with it given it's no longer used in 
> theory.
> [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
>  method
> caught my eye and I found that it does the following in
> client.register:
> {code}
> if (sc != null) sc.preferredNodeLocationData else Map()
> {code}
> However, {{client.register}} [ignores the input parameter 
> completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
>  but the scaladoc says (note {{preferredNodeLocations}} param):
> {code}
>   /**
>* Registers the application master with the RM.
>*
>* @param conf The Yarn configuration.
>* @param sparkConf The Spark configuration.
>* @param preferredNodeLocations Map with hints about where to allocate 
> containers.
>* @param uiAddress Address of the SparkUI.
>* @param uiHistoryAddress Address of the application on the History Server.
>*/
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942813#comment-14942813
 ] 

Sean Owen commented on SPARK-10921:
---

Given [~jlaskowski] just opened this hours ago and mentioned wanting to change 
this, I'd assume he is working on it.

> Completely remove the use of SparkContext.preferredNodeLocationData
> ---
>
> Key: SPARK-10921
> URL: https://issues.apache.org/jira/browse/SPARK-10921
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.5.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} 
> yet the code makes it less obvious as it says (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):
> {code}
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
>   // contains a map from hostname to a list of input format splits on the 
> host.
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> {code}
> It turns out that there are places where the initialization does take place 
> that only adds up to the confusion.
> When you search for the use of {{SparkContext.preferredNodeLocationData}},
> you'll find 3 places - one constructor marked {{@deprecated}}, the other with
> {{logWarning}} telling us that _"Passing in preferred locations has no
> effect at all, see SPARK-8949"_, and in
> {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.
> There is no consistent approach to deal with it given it's no longer used in 
> theory.
> [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
>  method
> caught my eye and I found that it does the following in
> client.register:
> {code}
> if (sc != null) sc.preferredNodeLocationData else Map()
> {code}
> However, {{client.register}} [ignores the input parameter 
> completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
>  but the scaladoc says (note {{preferredNodeLocations}} param):
> {code}
>   /**
>* Registers the application master with the RM.
>*
>* @param conf The Yarn configuration.
>* @param sparkConf The Spark configuration.
>* @param preferredNodeLocations Map with hints about where to allocate 
> containers.
>* @param uiAddress Address of the SparkUI.
>* @param uiHistoryAddress Address of the application on the History Server.
>*/
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-04 Thread Daljeet Virdi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942823#comment-14942823
 ] 

Daljeet Virdi commented on SPARK-10921:
---

Sorry, I'm new to this project.Thanks for clarifying. 

> Completely remove the use of SparkContext.preferredNodeLocationData
> ---
>
> Key: SPARK-10921
> URL: https://issues.apache.org/jira/browse/SPARK-10921
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.5.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} 
> yet the code makes it less obvious as it says (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):
> {code}
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
>   // contains a map from hostname to a list of input format splits on the 
> host.
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> {code}
> It turns out that there are places where the initialization does take place 
> that only adds up to the confusion.
> When you search for the use of {{SparkContext.preferredNodeLocationData}},
> you'll find 3 places - one constructor marked {{@deprecated}}, the other with
> {{logWarning}} telling us that _"Passing in preferred locations has no
> effect at all, see SPARK-8949"_, and in
> {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.
> There is no consistent approach to deal with it given it's no longer used in 
> theory.
> [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
>  method
> caught my eye and I found that it does the following in
> client.register:
> {code}
> if (sc != null) sc.preferredNodeLocationData else Map()
> {code}
> However, {{client.register}} [ignores the input parameter 
> completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
>  but the scaladoc says (note {{preferredNodeLocations}} param):
> {code}
>   /**
>* Registers the application master with the RM.
>*
>* @param conf The Yarn configuration.
>* @param sparkConf The Spark configuration.
>* @param preferredNodeLocations Map with hints about where to allocate 
> containers.
>* @param uiAddress Address of the SparkUI.
>* @param uiHistoryAddress Address of the application on the History Server.
>*/
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10918) Task failed because executor kill by driver

2015-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942889#comment-14942889
 ] 

Apache Spark commented on SPARK-10918:
--

User 'shenh062326' has created a pull request for this issue:
https://github.com/apache/spark/pull/8975

> Task failed because executor kill by driver
> ---
>
> Key: SPARK-10918
> URL: https://issues.apache.org/jira/browse/SPARK-10918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Hong Shen
>
> When dynamicAllocation is enabled, when a executor was idle timeout, it will 
> be kill by driver, if a task offer to the executor at the same time, the task 
> will failed due to executor lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10918) Task failed because executor kill by driver

2015-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10918:


Assignee: (was: Apache Spark)

> Task failed because executor kill by driver
> ---
>
> Key: SPARK-10918
> URL: https://issues.apache.org/jira/browse/SPARK-10918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Hong Shen
>
> When dynamicAllocation is enabled, when a executor was idle timeout, it will 
> be kill by driver, if a task offer to the executor at the same time, the task 
> will failed due to executor lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10918) Task failed because executor kill by driver

2015-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10918:


Assignee: Apache Spark

> Task failed because executor kill by driver
> ---
>
> Key: SPARK-10918
> URL: https://issues.apache.org/jira/browse/SPARK-10918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Hong Shen
>Assignee: Apache Spark
>
> When dynamicAllocation is enabled, when a executor was idle timeout, it will 
> be kill by driver, if a task offer to the executor at the same time, the task 
> will failed due to executor lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10896) Parquet join issue

2015-10-04 Thread Alex Rovner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942892#comment-14942892
 ] 

Alex Rovner commented on SPARK-10896:
-

I do not have an HDP cluster handy, so instead I am running on CDH 5.4 and I am 
unable to reproduce. I tried both with and without hadoop bundles and tried 
running both in local and YARN modes.

I always get the following as the last line in the output if I copy paste your 
commands:
{noformat}
res3: Array[org.apache.spark.sql.Row] = Array([0,0,0,0], [1,1,1,1], [2,2,2,2], 
[3,3,3,3], [4,4,4,4], [5,5,5,5], [6,6,6,6], [7,7,7,7])
{noformat}

> Parquet join issue
> --
>
> Key: SPARK-10896
> URL: https://issues.apache.org/jira/browse/SPARK-10896
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: spark-1.5.0-bin-hadoop2.6.tgz with HDP 2.3
>Reporter: Tamas Szuromi
>  Labels: dataframe, hdfs, join, parquet, sql
>
> After loading parquet files join is not working.
> How to reproduce:
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val arr1 = Array[Row](Row.apply(0, 0), Row.apply(1,1), Row.apply(2,2), 
> Row.apply(3, 3), Row.apply(4, 4), Row.apply(5, 5), Row.apply(6, 6), 
> Row.apply(7, 7))
> val schema1 = StructType(
>   StructField("id", IntegerType) ::
>   StructField("value1", IntegerType) :: Nil)
> val df1 = sqlContext.createDataFrame(sc.parallelize(arr1), schema1)
> val arr2 = Array[Row](Row.apply(0, 0), Row.apply(1,1), Row.apply(2,2), 
> Row.apply(3, 3), Row.apply(4, 4), Row.apply(5, 5), Row.apply(6, 6), 
> Row.apply(7, 7))
> val schema2 = StructType(
>   StructField("otherId", IntegerType) ::
>   StructField("value2", IntegerType) :: Nil)
> val df2 = sqlContext.createDataFrame(sc.parallelize(arr2), schema2)
> val res = df1.join(df2, df1("id")===df2("otherId"))
> df1.take(10)
> df2.take(10)
> res.count()
> res.take(10)
> df1.write.format("parquet").save("hdfs:///tmp/df1")
> df2.write.format("parquet").save("hdfs:///tmp/df2")
> val df1=sqlContext.read.parquet("hdfs:///tmp/df1/*.parquet")
> val df2=sqlContext.read.parquet("hdfs:///tmp/df2/*.parquet")
> val res = df1.join(df2, df1("id")===df2("otherId"))
> df1.take(10)
> df2.take(10)
> res.count()
> res.take(10)
> {code}
> Output
> {code:java}
> Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], 
> [5,5], [6,6], [7,7]) 
> Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], 
> [5,5], [6,6], [7,7]) 
> Long = 8 
> Array[org.apache.spark.sql.Row] = Array([0,0,0,0], [1,1,1,1], [2,2,2,2], 
> [3,3,3,3], [4,4,4,4], [5,5,5,5], [6,6,6,6], [7,7,7,7]) 
> {code}
> After reading back:
> {code:java}
> Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], 
> [5,5], [6,6], [7,7]) 
> Array[org.apache.spark.sql.Row] = Array([0,0], [1,1], [2,2], [3,3], [4,4], 
> [5,5], [6,6], [7,7]) 
> Long = 4 
> Array[org.apache.spark.sql.Row] = Array([0,0,0,5], [2,2,2,null], [4,4,4,5], 
> [6,6,6,null])
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10922) Core SparkTestSuite fail

2015-10-04 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-10922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Baptiste Onofré resolved SPARK-10922.
--
Resolution: Not A Problem

Run fine the second time.

> Core SparkTestSuite fail
> 
>
> Key: SPARK-10922
> URL: https://issues.apache.org/jira/browse/SPARK-10922
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Jean-Baptiste Onofré
>
> Core SparkTestSuite currently fails on master:
> - includes jars passed in through --jars *** FAILED ***
> - includes jars passed in through --packages *** FAILED ***
> - sendWithReply: remotely *** FAILED ***
> I'm taking a look (it looks good on Jenkins).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942935#comment-14942935
 ] 

Apache Spark commented on SPARK-10921:
--

User 'jaceklaskowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/8976

> Completely remove the use of SparkContext.preferredNodeLocationData
> ---
>
> Key: SPARK-10921
> URL: https://issues.apache.org/jira/browse/SPARK-10921
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.5.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} 
> yet the code makes it less obvious as it says (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):
> {code}
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
>   // contains a map from hostname to a list of input format splits on the 
> host.
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> {code}
> It turns out that there are places where the initialization does take place 
> that only adds up to the confusion.
> When you search for the use of {{SparkContext.preferredNodeLocationData}},
> you'll find 3 places - one constructor marked {{@deprecated}}, the other with
> {{logWarning}} telling us that _"Passing in preferred locations has no
> effect at all, see SPARK-8949"_, and in
> {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.
> There is no consistent approach to deal with it given it's no longer used in 
> theory.
> [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
>  method
> caught my eye and I found that it does the following in
> client.register:
> {code}
> if (sc != null) sc.preferredNodeLocationData else Map()
> {code}
> However, {{client.register}} [ignores the input parameter 
> completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
>  but the scaladoc says (note {{preferredNodeLocations}} param):
> {code}
>   /**
>* Registers the application master with the RM.
>*
>* @param conf The Yarn configuration.
>* @param sparkConf The Spark configuration.
>* @param preferredNodeLocations Map with hints about where to allocate 
> containers.
>* @param uiAddress Address of the SparkUI.
>* @param uiHistoryAddress Address of the application on the History Server.
>*/
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10921:


Assignee: (was: Apache Spark)

> Completely remove the use of SparkContext.preferredNodeLocationData
> ---
>
> Key: SPARK-10921
> URL: https://issues.apache.org/jira/browse/SPARK-10921
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.5.1
>Reporter: Jacek Laskowski
>Priority: Minor
>
> SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} 
> yet the code makes it less obvious as it says (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):
> {code}
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
>   // contains a map from hostname to a list of input format splits on the 
> host.
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> {code}
> It turns out that there are places where the initialization does take place 
> that only adds up to the confusion.
> When you search for the use of {{SparkContext.preferredNodeLocationData}},
> you'll find 3 places - one constructor marked {{@deprecated}}, the other with
> {{logWarning}} telling us that _"Passing in preferred locations has no
> effect at all, see SPARK-8949"_, and in
> {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.
> There is no consistent approach to deal with it given it's no longer used in 
> theory.
> [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
>  method
> caught my eye and I found that it does the following in
> client.register:
> {code}
> if (sc != null) sc.preferredNodeLocationData else Map()
> {code}
> However, {{client.register}} [ignores the input parameter 
> completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
>  but the scaladoc says (note {{preferredNodeLocations}} param):
> {code}
>   /**
>* Registers the application master with the RM.
>*
>* @param conf The Yarn configuration.
>* @param sparkConf The Spark configuration.
>* @param preferredNodeLocations Map with hints about where to allocate 
> containers.
>* @param uiAddress Address of the SparkUI.
>* @param uiHistoryAddress Address of the application on the History Server.
>*/
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-04 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10921:


Assignee: Apache Spark

> Completely remove the use of SparkContext.preferredNodeLocationData
> ---
>
> Key: SPARK-10921
> URL: https://issues.apache.org/jira/browse/SPARK-10921
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.5.1
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} 
> yet the code makes it less obvious as it says (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):
> {code}
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
>   // contains a map from hostname to a list of input format splits on the 
> host.
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> {code}
> It turns out that there are places where the initialization does take place 
> that only adds up to the confusion.
> When you search for the use of {{SparkContext.preferredNodeLocationData}},
> you'll find 3 places - one constructor marked {{@deprecated}}, the other with
> {{logWarning}} telling us that _"Passing in preferred locations has no
> effect at all, see SPARK-8949"_, and in
> {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.
> There is no consistent approach to deal with it given it's no longer used in 
> theory.
> [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
>  method
> caught my eye and I found that it does the following in
> client.register:
> {code}
> if (sc != null) sc.preferredNodeLocationData else Map()
> {code}
> However, {{client.register}} [ignores the input parameter 
> completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
>  but the scaladoc says (note {{preferredNodeLocations}} param):
> {code}
>   /**
>* Registers the application master with the RM.
>*
>* @param conf The Yarn configuration.
>* @param sparkConf The Spark configuration.
>* @param preferredNodeLocations Map with hints about where to allocate 
> containers.
>* @param uiAddress Address of the SparkUI.
>* @param uiHistoryAddress Address of the application on the History Server.
>*/
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-10-04 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942971#comment-14942971
 ] 

Meihua Wu commented on SPARK-7129:
--

Currently I am not aware of a straightforward way to impose the weak 
restriction using the type system yet. Let's keep discuss. 

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org