[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604437#comment-15604437
 ] 

Hyukjin Kwon commented on SPARK-18068:
--

Oh, it is documented only in codes - 
https://github.com/apache/spark/blob/b3130c7b6a1ab4975023f08c3ab02ee8d2c7e995/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L287-L289



> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>Priority: Minor
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)

[jira] [Comment Edited] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604377#comment-15604377
 ] 

Hyukjin Kwon edited comment on SPARK-18068 at 10/25/16 7:01 AM:


[~stephane.maa...@gmail.com]  As a workaround, we might be able to do this as 
below:

{code}
val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
val schema = StructType(StructField("tst",TimestampType,true):: Nil)
val df = spark.read
  .schema(schema)
  .option("timestampFormat", "-MM-dd'T'HH:mmZZ")
  .json(jsonRDD)
df.show()

++
| tst|
++
|2016-10-07 20:15:...|
++
{code}

FYI, actually the exception came from..

{code}
DatatypeConverter.parseDateTime(s).getTime()
{code}

EDITED: Oh, I missed that Sean also said the same thing.



was (Author: hyukjin.kwon):
[~stephane.maa...@gmail.com]  As a workaround, we might be able to do this as 
below:

{code}
val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
val schema = StructType(StructField("tst",TimestampType,true):: Nil)
val df = spark.read
  .schema(schema)
  .option("timestampFormat", "-MM-dd'T'HH:mmZZ")
  .json(jsonRDD)
df.show()

++
| tst|
++
|2016-10-07 20:15:...|
++
{code}

FYI, actually the exception came from..

{code}
DatatypeConverter.parseDateTime(s).getTime()
{code}


> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>Priority: Minor
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> 

[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-25 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604443#comment-15604443
 ] 

Stephane Maarek commented on SPARK-18068:
-

Would be awesome to expose it in the only docs!




> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>Priority: Minor
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85

[jira] [Resolved] (SPARK-18026) should not always lowercase partition columns of partition spec in parser

2016-10-25 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18026.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15566
[https://github.com/apache/spark/pull/15566]

> should not always lowercase partition columns of partition spec in parser
> -
>
> Key: SPARK-18026
> URL: https://issues.apache.org/jira/browse/SPARK-18026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18091) Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit

2016-10-25 Thread Kapil Singh (JIRA)
Kapil Singh created SPARK-18091:
---

 Summary: Deep if expressions cause Generated 
SpecificUnsafeProjection code to exceed JVM code size limit
 Key: SPARK-18091
 URL: https://issues.apache.org/jira/browse/SPARK-18091
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Kapil Singh
Priority: Critical


*Problem Description:*
I have an application in which a lot of if-else decisioning is involved to 
generate output. I'm getting following exception:
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:874)
at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:965)
at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:10261)

*Steps to Reproduce:*
I've come up with a unit test which I was able to run in 
CodeGenerationSuite.scala:
{code}
test("split large if expressions into blocks due to JVM code size limit") {
val row = 
create_row("afafFAFFsqcategory2dadDADcategory8sasasadscategory24", 0)
val inputStr = 'a.string.at(0)
val inputIdx = 'a.int.at(1)

val length = 10
val valuesToCompareTo = for (i <- 1 to (length + 1)) yield ("category" + i)

val initCondition = EqualTo(RegExpExtract(inputStr, Literal("category1"), 
inputIdx), valuesToCompareTo(0))
var res: Expression = If(initCondition, Literal("category1"), 
Literal("NULL"))
var cummulativeCondition: Expression = Not(initCondition)
for (index <- 1 to length) {
  val valueExtractedFromInput = RegExpExtract(inputStr, Literal("category" 
+ (index + 1).toString), inputIdx)
  val currComparee = If(cummulativeCondition, valueExtractedFromInput, 
Literal("NULL"))
  val currCondition = EqualTo(currComparee, valuesToCompareTo(index))
  val combinedCond = And(cummulativeCondition, currCondition)
  res = If(combinedCond, If(combinedCond, valueExtractedFromInput, 
Literal("NULL")), res)
  cummulativeCondition = And(Not(currCondition), cummulativeCondition)
}

val expressions = Seq(res)
val plan = GenerateUnsafeProjection.generate(expressions, true)
val actual = plan(row).toSeq(expressions.map(_.dataType))
val expected = Seq(UTF8String.fromString("category2"))

if (!checkResult(actual, expected)) {
  fail(s"Incorrect Evaluation: expressions: $expressions, actual: $actual, 
expected: $expected")
}
  }
{code}

*Root Cause:*
Current splitting of Projection codes doesn't (and can't) take care of 
splitting the generated code for individual output column expressions. So it 
can grow to exceed JVM limit.

*Proposed Fix:*
If expression should place it's predicate, true value and false value 
expressions' generated code in separate methods in context and call these 
methods instead of putting the whole code directly in its generated code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18091) Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit

2016-10-25 Thread Kapil Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Singh updated SPARK-18091:

Description: 
*Problem Description:*
I have an application in which a lot of if-else decisioning is involved to 
generate output. I'm getting following exception:
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:874)
at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:965)
at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:10261)

*Steps to Reproduce:*
I've come up with a unit test which I was able to run in 
CodeGenerationSuite.scala:
{code}
test("split large if expressions into blocks due to JVM code size limit") {
val row = 
create_row("afafFAFFsqcategory2dadDADcategory8sasasadscategory24", 0)
val inputStr = 'a.string.at(0)
val inputIdx = 'a.int.at(1)

val length = 10
val valuesToCompareTo = for (i <- 1 to (length + 1)) yield ("category" + i)

val initCondition = EqualTo(RegExpExtract(inputStr, Literal("category1"), 
inputIdx), valuesToCompareTo(0))
var res: Expression = If(initCondition, Literal("category1"), 
Literal("NULL"))
var cummulativeCondition: Expression = Not(initCondition)
for (index <- 1 to length) {
  val valueExtractedFromInput = RegExpExtract(inputStr, Literal("category" 
+ (index + 1).toString), inputIdx)
  val currComparee = If(cummulativeCondition, valueExtractedFromInput, 
Literal("NULL"))
  val currCondition = EqualTo(currComparee, valuesToCompareTo(index))
  val combinedCond = And(cummulativeCondition, currCondition)
  res = If(combinedCond, If(combinedCond, valueExtractedFromInput, 
Literal("NULL")), res)
  cummulativeCondition = And(Not(currCondition), cummulativeCondition)
}

val expressions = Seq(res)
val plan = GenerateUnsafeProjection.generate(expressions, true)
val actual = plan(row).toSeq(expressions.map(_.dataType))
val expected = Seq(UTF8String.fromString("category2"))

if (!checkResult(actual, expected)) {
  fail(s"Incorrect Evaluation: expressions: $expressions, actual: $actual, 
expected: $expected")
}
  }
{code}

*Root Cause:*
Current splitting of Projection codes doesn't (and can't) take care of 
splitting the generated code for individual output column expressions. So it 
can grow to exceed JVM limit.

*Note:* This issue seems related to SPARK-14887 but I'm not sure whether the 
root cause is same
 
*Proposed Fix:*
If expression should place it's predicate, true value and false value 
expressions' generated code in separate methods in context and call these 
methods instead of putting the whole code directly in its generated code

  was:
*Problem Description:*
I have an application in which a lot of if-else decisioning is involved to 
generate output. I'm getting following exception:
Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
"(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
 of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
 grows beyond 64 KB
at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
at org.codehaus.janino.CodeContext.write(CodeContext.java:874)
at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:965)
at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:10261)

*Steps to Reproduce:*
I've come up with a unit test which I was able to run in 
CodeGenerationSuite.scala:
{code}
test("split large if expressions into blocks due to JVM code size limit") {
val row = 
create_row("afafFAFFsqcategory2dadDADcategory8sasasadscategory24", 0)
val inputStr = 'a.string.at(0)
val inputIdx = 'a.int.at(1)

val length = 10
val valuesToCompareTo = for (i <- 1 to (length + 1)) yield ("category" + i)

val initCondition = EqualTo(RegExpExtract(inputStr, Literal("category1"), 
inputIdx), valuesToCompareTo(0))
var res: Expression = If(initCondition, Literal("category1"), 
Literal("NULL"))
var cummulativeCondition: Expression = Not(initCondition)
for (index <- 1 to length) {
  val valueExtractedFromInput = RegExpExtract(inputStr, Literal("category" 
+ (index + 1).toString), inputIdx)
  val currComparee = If(cummulativeCondition, valueExtractedFromInput, 
Literal("NULL"))
  val currCondition = EqualTo(currComparee, valuesToCompareTo(index))
  val combinedCond = And(cummulativeCondi

[jira] [Commented] (SPARK-13331) AES support for over-the-wire encryption

2016-10-25 Thread Junjie Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604582#comment-15604582
 ] 

Junjie Chen commented on SPARK-13331:
-

Hi [~vanzin]

The updated patch was committed according to your comments. 

I tried to change the negotiation by sending configuration to server without 
waiting for response, but server end does not get data as expected. From the 
description of TransportClient.send, it doesn't guarantee delivery.

I think it should wait for response from server to do a handshake here, 
otherwise client will send encrypted data out and server may still not ready to 
accept encrypted data.  isn't it? 

> AES support for over-the-wire encryption
> 
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Dong Chen
>Priority: Minor
>
> In network/common, SASL with DIGEST­-MD5 authentication is used for 
> negotiating a secure communication channel. When SASL operation mode is 
> "auth­-conf", the data transferred on the network is encrypted. DIGEST-MD5 
> mechanism supports following encryption: 3DES, DES, and RC4. The negotiation 
> procedure will select one of them to encrypt / decrypt the data on the 
> channel.
> However, 3des and rc4 are slow relatively. We could add code in the 
> negotiation to make it support AES for more secure and performance.
> The proposed solution is:
> When "auth-conf" is enabled, at the end of original negotiation, the 
> authentication succeeds and a secure channel is built. We could add one more 
> negotiation step: Client and server negotiate whether they both support AES. 
> If yes, the Key and IV used by AES will be generated by server and sent to 
> client through the already secure channel. Then update the encryption / 
> decryption handler to AES at both client and server side. Following data 
> transfer will use AES instead of original encryption algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18091) Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit

2016-10-25 Thread Kapil Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604587#comment-15604587
 ] 

Kapil Singh commented on SPARK-18091:
-

I've started working on this

> Deep if expressions cause Generated SpecificUnsafeProjection code to exceed 
> JVM code size limit
> ---
>
> Key: SPARK-18091
> URL: https://issues.apache.org/jira/browse/SPARK-18091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Kapil Singh
>Priority: Critical
>
> *Problem Description:*
> I have an application in which a lot of if-else decisioning is involved to 
> generate output. I'm getting following exception:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:874)
>   at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:965)
>   at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:10261)
> *Steps to Reproduce:*
> I've come up with a unit test which I was able to run in 
> CodeGenerationSuite.scala:
> {code}
> test("split large if expressions into blocks due to JVM code size limit") {
> val row = 
> create_row("afafFAFFsqcategory2dadDADcategory8sasasadscategory24", 0)
> val inputStr = 'a.string.at(0)
> val inputIdx = 'a.int.at(1)
> val length = 10
> val valuesToCompareTo = for (i <- 1 to (length + 1)) yield ("category" + 
> i)
> val initCondition = EqualTo(RegExpExtract(inputStr, Literal("category1"), 
> inputIdx), valuesToCompareTo(0))
> var res: Expression = If(initCondition, Literal("category1"), 
> Literal("NULL"))
> var cummulativeCondition: Expression = Not(initCondition)
> for (index <- 1 to length) {
>   val valueExtractedFromInput = RegExpExtract(inputStr, 
> Literal("category" + (index + 1).toString), inputIdx)
>   val currComparee = If(cummulativeCondition, valueExtractedFromInput, 
> Literal("NULL"))
>   val currCondition = EqualTo(currComparee, valuesToCompareTo(index))
>   val combinedCond = And(cummulativeCondition, currCondition)
>   res = If(combinedCond, If(combinedCond, valueExtractedFromInput, 
> Literal("NULL")), res)
>   cummulativeCondition = And(Not(currCondition), cummulativeCondition)
> }
> val expressions = Seq(res)
> val plan = GenerateUnsafeProjection.generate(expressions, true)
> val actual = plan(row).toSeq(expressions.map(_.dataType))
> val expected = Seq(UTF8String.fromString("category2"))
> if (!checkResult(actual, expected)) {
>   fail(s"Incorrect Evaluation: expressions: $expressions, actual: 
> $actual, expected: $expected")
> }
>   }
> {code}
> *Root Cause:*
> Current splitting of Projection codes doesn't (and can't) take care of 
> splitting the generated code for individual output column expressions. So it 
> can grow to exceed JVM limit.
> *Note:* This issue seems related to SPARK-14887 but I'm not sure whether the 
> root cause is same
>  
> *Proposed Fix:*
> If expression should place it's predicate, true value and false value 
> expressions' generated code in separate methods in context and call these 
> methods instead of putting the whole code directly in its generated code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14745) CEP support in Spark Streaming

2016-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14745.
---
Resolution: Won't Fix

> CEP support in Spark Streaming
> --
>
> Key: SPARK-14745
> URL: https://issues.apache.org/jira/browse/SPARK-14745
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Mario Briggs
> Attachments: SparkStreamingCEP.pdf
>
>
> Complex Event Processing is a often used feature in Streaming applications. 
> Spark Streaming current does not have a DSL/API for it. This JIRA is about 
> how/what can we add in Spark Streaming to support CEP out of the box



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14634) Add BisectingKMeansSummary

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604652#comment-15604652
 ] 

Apache Spark commented on SPARK-14634:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15619

> Add BisectingKMeansSummary
> --
>
> Key: SPARK-14634
> URL: https://issues.apache.org/jira/browse/SPARK-14634
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.1.0
>
>
> Add BisectingKMeansSummary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14230) Config the start time (jitter) for streaming jobs

2016-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14230.
---
Resolution: Won't Fix

> Config the start time (jitter) for streaming jobs
> -
>
> Key: SPARK-14230
> URL: https://issues.apache.org/jira/browse/SPARK-14230
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Liyin Tang
>
> Currently, RecurringTimer will normalize the start time. For instance, if 
> batch duration is 1 min, all the job will start exactly at 1 min boundary. 
> This actually adds some burden to the streaming source. Assuming the source 
> is Kafka, and there is a list of streaming jobs with 1 min batch duration, 
> then at first few seconds of each min, high network traffic will be observed 
> in Kafka. This makes Kafka capacity planning tricky. 
> It will be great to have an option in the streaming context to set the job 
> start time. In this way, user can add a jitter for the start time for each, 
> and make Kafka fetch_request much smooth across the duration window.
> {code}
> class RecurringTimer {
>   def getStartTime(): Long = {
> (math.floor(clock.currentTime.toDouble / period) + 1).toLong * period + 
> jitter
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15429) When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate the receiving rate accurately.

2016-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15429.
---
Resolution: Won't Fix

> When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate 
> the receiving rate accurately.
> --
>
> Key: SPARK-15429
> URL: https://issues.apache.org/jira/browse/SPARK-15429
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: Albert Cheng
>
> When `spark.streaming.concurrentJobs > 1`, PIDRateEstimator cannot estimate 
> the receiving rate accurately.
> For example, if the batch duration is set to 10 seconds, each rdd in the 
> dstream will take 20s to compute. By changing 
> `spark.streaming.concurrentJobs=2`, each rdd in the dstream still takes 20s 
> to consume the data, which leads to poor estimation of backpressure by 
> PIDRateEstimator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18039) ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced

2016-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-18039.
---
Resolution: Won't Fix

> ReceiverTracker run dummyjob too fast cause receiver scheduling unbalaced
> -
>
> Key: SPARK-18039
> URL: https://issues.apache.org/jira/browse/SPARK-18039
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
>Reporter: astralidea
>Priority: Minor
>
> receiver scheduling balance is important for me 
> for instance 
> if I have 2 executor, each executor has 1 receiver, calc time is 0.1s per 
> batch.
> but if  I have 2 executor, one executor has 2 receiver and another is 0 
> receiver ,calc time is increase 3s per batch.
> In my cluster executor init is slow I need about 30s to wait.
> but dummy job only run 4s to wait, I add conf 
> spark.scheduler.maxRegisteredResourcesWaitingTime it does not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema

2016-10-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604777#comment-15604777
 ] 

Sean Owen commented on SPARK-18065:
---

It might be a little late, but if you find a good spot in the release notes or 
docs to note this, go ahead.

> Spark 2 allows filter/where on columns not in current schema
> 
>
> Key: SPARK-18065
> URL: https://issues.apache.org/jira/browse/SPARK-18065
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Matthew Scruggs
>Priority: Minor
>
> I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
> DataFrame that previously had a column, but no longer has it in its schema 
> due to a select() operation.
> In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
> attempting to filter/where using the selected-out column:
> {code:title=Spark 1.6.2}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.2
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), 
> (2, "two".selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
> columns: [id];
> {code}
> However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds 
> (no AnalysisException) and seems to filter out data as if the column remains:
> {code:title=Spark 2.0.1}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
> "two"))).toDF().selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> +---+
> | id|
> +---+
> |  1|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17987) ML Evaluator fails to handle null values in the dataset

2016-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17987.
---
Resolution: Not A Problem

> ML Evaluator fails to handle null values in the dataset
> ---
>
> Key: SPARK-17987
> URL: https://issues.apache.org/jira/browse/SPARK-17987
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.2, 2.0.1
>Reporter: bo song
>
> Take the RegressionEvaluator as an example, when the predictionCol is null in 
> a row, en exception "scala.MatchEror" will be thrown. The missing null 
> prediction is a common case, for example when an predictor is missing, or its 
> value is out of bound, almost machine learning models could not produce 
> correct predictions, then null predictions would be returned. Evaluators 
> should handle the null values instead of an exception thrown, the common way 
> to handle missing null values is to ignore them. Besides of the null value, 
> the NAN value need to be handled correctly too. 
> Those three evaluators RegressionEvaluator, BinaryClassificationEvaluator and 
> MulticlassClassificationEvaluator have the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10319) ALS training using PySpark throws a StackOverflowError

2016-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10319.
---
Resolution: Cannot Reproduce

> ALS training using PySpark throws a StackOverflowError
> --
>
> Key: SPARK-10319
> URL: https://issues.apache.org/jira/browse/SPARK-10319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
> Environment: Windows 10, spark - 1.4.1,
>Reporter: Velu nambi
>
> When attempting to train a machine learning model using ALS in Spark's MLLib 
> (1.4) on windows, Pyspark always terminates with a StackoverflowError. I 
> tried adding the checkpoint as described in 
> http://stackoverflow.com/a/31484461/36130 -- doesn't seem to help.
> Here's the training code and stack trace:
> {code:none}
> ranks = [8, 12]
> lambdas = [0.1, 10.0]
> numIters = [10, 20]
> bestModel = None
> bestValidationRmse = float("inf")
> bestRank = 0
> bestLambda = -1.0
> bestNumIter = -1
> for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters):
> ALS.checkpointInterval = 2
> model = ALS.train(training, rank, numIter, lmbda)
> validationRmse = computeRmse(model, validation, numValidation)
> if (validationRmse < bestValidationRmse):
>  bestModel = model
>  bestValidationRmse = validationRmse
>  bestRank = rank
>  bestLambda = lmbda
>  bestNumIter = numIter
> testRmse = computeRmse(bestModel, test, numTest)
> {code}
> Stacktrace:
> 15/08/27 02:02:58 ERROR Executor: Exception in task 3.0 in stage 56.0 (TID 
> 127)
> java.lang.StackOverflowError
> at java.io.ObjectInputStream$BlockDataInputStream.readInt(Unknown Source)
> at java.io.ObjectInputStream.readHandle(Unknown Source)
> at java.io.ObjectInputStream.readClassDesc(Unknown Source)
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
> at java.io.ObjectInputStream.readObject0(Unknown Source)
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
> at java.io.ObjectInputStream.readObject0(Unknown Source)
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
> at java.io.ObjectInputStream.readObject0(Unknown Source)
> at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
> at java.io.ObjectInputStream.readSerialData(Unknown Source)
> at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
> at java.io.ObjectInputStream.readObject0(Unknown Source)
> at java.io.ObjectInputStream.readObject(Unknown Source)
> at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at java.io.ObjectStreamClass.invokeReadObject(Unknown Source)
> at java.io.ObjectInputStream.readSerialData(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604794#comment-15604794
 ] 

Sean Owen commented on SPARK-18068:
---

It's documented in the API docs, as in: 
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#json(java.lang.String)


> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>Priority: Minor
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask

[jira] [Commented] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604805#comment-15604805
 ] 

Apache Spark commented on SPARK-17748:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/15621

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
> Fix For: 2.1.0
>
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18091) Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18091:


Assignee: Apache Spark

> Deep if expressions cause Generated SpecificUnsafeProjection code to exceed 
> JVM code size limit
> ---
>
> Key: SPARK-18091
> URL: https://issues.apache.org/jira/browse/SPARK-18091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Kapil Singh
>Assignee: Apache Spark
>Priority: Critical
>
> *Problem Description:*
> I have an application in which a lot of if-else decisioning is involved to 
> generate output. I'm getting following exception:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:874)
>   at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:965)
>   at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:10261)
> *Steps to Reproduce:*
> I've come up with a unit test which I was able to run in 
> CodeGenerationSuite.scala:
> {code}
> test("split large if expressions into blocks due to JVM code size limit") {
> val row = 
> create_row("afafFAFFsqcategory2dadDADcategory8sasasadscategory24", 0)
> val inputStr = 'a.string.at(0)
> val inputIdx = 'a.int.at(1)
> val length = 10
> val valuesToCompareTo = for (i <- 1 to (length + 1)) yield ("category" + 
> i)
> val initCondition = EqualTo(RegExpExtract(inputStr, Literal("category1"), 
> inputIdx), valuesToCompareTo(0))
> var res: Expression = If(initCondition, Literal("category1"), 
> Literal("NULL"))
> var cummulativeCondition: Expression = Not(initCondition)
> for (index <- 1 to length) {
>   val valueExtractedFromInput = RegExpExtract(inputStr, 
> Literal("category" + (index + 1).toString), inputIdx)
>   val currComparee = If(cummulativeCondition, valueExtractedFromInput, 
> Literal("NULL"))
>   val currCondition = EqualTo(currComparee, valuesToCompareTo(index))
>   val combinedCond = And(cummulativeCondition, currCondition)
>   res = If(combinedCond, If(combinedCond, valueExtractedFromInput, 
> Literal("NULL")), res)
>   cummulativeCondition = And(Not(currCondition), cummulativeCondition)
> }
> val expressions = Seq(res)
> val plan = GenerateUnsafeProjection.generate(expressions, true)
> val actual = plan(row).toSeq(expressions.map(_.dataType))
> val expected = Seq(UTF8String.fromString("category2"))
> if (!checkResult(actual, expected)) {
>   fail(s"Incorrect Evaluation: expressions: $expressions, actual: 
> $actual, expected: $expected")
> }
>   }
> {code}
> *Root Cause:*
> Current splitting of Projection codes doesn't (and can't) take care of 
> splitting the generated code for individual output column expressions. So it 
> can grow to exceed JVM limit.
> *Note:* This issue seems related to SPARK-14887 but I'm not sure whether the 
> root cause is same
>  
> *Proposed Fix:*
> If expression should place it's predicate, true value and false value 
> expressions' generated code in separate methods in context and call these 
> methods instead of putting the whole code directly in its generated code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18091) Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18091:


Assignee: (was: Apache Spark)

> Deep if expressions cause Generated SpecificUnsafeProjection code to exceed 
> JVM code size limit
> ---
>
> Key: SPARK-18091
> URL: https://issues.apache.org/jira/browse/SPARK-18091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Kapil Singh
>Priority: Critical
>
> *Problem Description:*
> I have an application in which a lot of if-else decisioning is involved to 
> generate output. I'm getting following exception:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:874)
>   at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:965)
>   at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:10261)
> *Steps to Reproduce:*
> I've come up with a unit test which I was able to run in 
> CodeGenerationSuite.scala:
> {code}
> test("split large if expressions into blocks due to JVM code size limit") {
> val row = 
> create_row("afafFAFFsqcategory2dadDADcategory8sasasadscategory24", 0)
> val inputStr = 'a.string.at(0)
> val inputIdx = 'a.int.at(1)
> val length = 10
> val valuesToCompareTo = for (i <- 1 to (length + 1)) yield ("category" + 
> i)
> val initCondition = EqualTo(RegExpExtract(inputStr, Literal("category1"), 
> inputIdx), valuesToCompareTo(0))
> var res: Expression = If(initCondition, Literal("category1"), 
> Literal("NULL"))
> var cummulativeCondition: Expression = Not(initCondition)
> for (index <- 1 to length) {
>   val valueExtractedFromInput = RegExpExtract(inputStr, 
> Literal("category" + (index + 1).toString), inputIdx)
>   val currComparee = If(cummulativeCondition, valueExtractedFromInput, 
> Literal("NULL"))
>   val currCondition = EqualTo(currComparee, valuesToCompareTo(index))
>   val combinedCond = And(cummulativeCondition, currCondition)
>   res = If(combinedCond, If(combinedCond, valueExtractedFromInput, 
> Literal("NULL")), res)
>   cummulativeCondition = And(Not(currCondition), cummulativeCondition)
> }
> val expressions = Seq(res)
> val plan = GenerateUnsafeProjection.generate(expressions, true)
> val actual = plan(row).toSeq(expressions.map(_.dataType))
> val expected = Seq(UTF8String.fromString("category2"))
> if (!checkResult(actual, expected)) {
>   fail(s"Incorrect Evaluation: expressions: $expressions, actual: 
> $actual, expected: $expected")
> }
>   }
> {code}
> *Root Cause:*
> Current splitting of Projection codes doesn't (and can't) take care of 
> splitting the generated code for individual output column expressions. So it 
> can grow to exceed JVM limit.
> *Note:* This issue seems related to SPARK-14887 but I'm not sure whether the 
> root cause is same
>  
> *Proposed Fix:*
> If expression should place it's predicate, true value and false value 
> expressions' generated code in separate methods in context and call these 
> methods instead of putting the whole code directly in its generated code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18091) Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604806#comment-15604806
 ] 

Apache Spark commented on SPARK-18091:
--

User 'kapilsingh5050' has created a pull request for this issue:
https://github.com/apache/spark/pull/15620

> Deep if expressions cause Generated SpecificUnsafeProjection code to exceed 
> JVM code size limit
> ---
>
> Key: SPARK-18091
> URL: https://issues.apache.org/jira/browse/SPARK-18091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Kapil Singh
>Priority: Critical
>
> *Problem Description:*
> I have an application in which a lot of if-else decisioning is involved to 
> generate output. I'm getting following exception:
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:874)
>   at org.codehaus.janino.CodeContext.writeBranch(CodeContext.java:965)
>   at org.codehaus.janino.UnitCompiler.writeBranch(UnitCompiler.java:10261)
> *Steps to Reproduce:*
> I've come up with a unit test which I was able to run in 
> CodeGenerationSuite.scala:
> {code}
> test("split large if expressions into blocks due to JVM code size limit") {
> val row = 
> create_row("afafFAFFsqcategory2dadDADcategory8sasasadscategory24", 0)
> val inputStr = 'a.string.at(0)
> val inputIdx = 'a.int.at(1)
> val length = 10
> val valuesToCompareTo = for (i <- 1 to (length + 1)) yield ("category" + 
> i)
> val initCondition = EqualTo(RegExpExtract(inputStr, Literal("category1"), 
> inputIdx), valuesToCompareTo(0))
> var res: Expression = If(initCondition, Literal("category1"), 
> Literal("NULL"))
> var cummulativeCondition: Expression = Not(initCondition)
> for (index <- 1 to length) {
>   val valueExtractedFromInput = RegExpExtract(inputStr, 
> Literal("category" + (index + 1).toString), inputIdx)
>   val currComparee = If(cummulativeCondition, valueExtractedFromInput, 
> Literal("NULL"))
>   val currCondition = EqualTo(currComparee, valuesToCompareTo(index))
>   val combinedCond = And(cummulativeCondition, currCondition)
>   res = If(combinedCond, If(combinedCond, valueExtractedFromInput, 
> Literal("NULL")), res)
>   cummulativeCondition = And(Not(currCondition), cummulativeCondition)
> }
> val expressions = Seq(res)
> val plan = GenerateUnsafeProjection.generate(expressions, true)
> val actual = plan(row).toSeq(expressions.map(_.dataType))
> val expected = Seq(UTF8String.fromString("category2"))
> if (!checkResult(actual, expected)) {
>   fail(s"Incorrect Evaluation: expressions: $expressions, actual: 
> $actual, expected: $expected")
> }
>   }
> {code}
> *Root Cause:*
> Current splitting of Projection codes doesn't (and can't) take care of 
> splitting the generated code for individual output column expressions. So it 
> can grow to exceed JVM limit.
> *Note:* This issue seems related to SPARK-14887 but I'm not sure whether the 
> root cause is same
>  
> *Proposed Fix:*
> If expression should place it's predicate, true value and false value 
> expressions' generated code in separate methods in context and call these 
> methods instead of putting the whole code directly in its generated code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join due to calculated size of LogicalRdd

2016-10-25 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604810#comment-15604810
 ] 

Stavros Kontopoulos commented on SPARK-17959:
-

[~srowen]What do you think? I was not sure who I should ping this time.

> spark.sql.join.preferSortMergeJoin has no effect for simple join due to 
> calculated size of LogicalRdd
> -
>
> Key: SPARK-17959
> URL: https://issues.apache.org/jira/browse/SPARK-17959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stavros Kontopoulos
>
> Example code:   
> val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
> "dss@s2"),
>   ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
> "Email")
> df.as("a").join(df.as("b"))
>   .where($"a.Group" === $"b.Group")
>   .explain()
> I always get the SortMerge strategy (never shuffle hash join) even if i set 
> spark.sql.join.preferSortMergeJoin to false since:
> sinzeInBytes = 2^63-1
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
> and thus:
> condition here: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
> is always false...
> I think this shouldnt be the case my df has a specifc size and number of 
> partitions (200 which is btw far from optimal)...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server

2016-10-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604813#comment-15604813
 ] 

Sean Owen commented on SPARK-18085:
---

How does this relate to https://issues.apache.org/jira/browse/SPARK-6951 ? as 
long as this is taken over the "history server is slow" umbrella so there's 
just one, seems OK.

> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604437#comment-15604437
 ] 

Hyukjin Kwon edited comment on SPARK-18068 at 10/25/16 9:48 AM:


Oh, it is documented -only in codes- - 
https://github.com/apache/spark/blob/b3130c7b6a1ab4975023f08c3ab02ee8d2c7e995/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L287-L289




was (Author: hyukjin.kwon):
Oh, it is documented only in codes - 
https://github.com/apache/spark/blob/b3130c7b6a1ab4975023f08c3ab02ee8d2c7e995/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L287-L289



> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>Priority: Minor
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPart

[jira] [Updated] (SPARK-17959) spark.sql.join.preferSortMergeJoin has no effect for simple join due to calculated size of LogicalRdd

2016-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17959:
--
Issue Type: Improvement  (was: Bug)

Not sure, it's not my area. It may not be a bug though, but at best an 
improvement, because it seems like it's either on purpose or else just not 
aware that a different join is  more optimal because of the default here.

> spark.sql.join.preferSortMergeJoin has no effect for simple join due to 
> calculated size of LogicalRdd
> -
>
> Key: SPARK-17959
> URL: https://issues.apache.org/jira/browse/SPARK-17959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stavros Kontopoulos
>
> Example code:   
> val df = spark.sparkContext.parallelize(List(("A", 10, "dss@s1"), ("A", 20, 
> "dss@s2"),
>   ("B", 1, "dss@qqa"), ("B", 2, "dss@qqb"))).toDF("Group", "Amount", 
> "Email")
> df.as("a").join(df.as("b"))
>   .where($"a.Group" === $"b.Group")
>   .explain()
> I always get the SortMerge strategy (never shuffle hash join) even if i set 
> spark.sql.join.preferSortMergeJoin to false since:
> sinzeInBytes = 2^63-1
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala#L101
> and thus:
> condition here: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L127
> is always false...
> I think this shouldnt be the case my df has a specifc size and number of 
> partitions (200 which is btw far from optimal)...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18068) Spark SQL doesn't parse some ISO 8601 formatted dates

2016-10-25 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604903#comment-15604903
 ] 

Stephane Maarek commented on SPARK-18068:
-

Thanks for the links guys! Really helpful




> Spark SQL doesn't parse some ISO 8601 formatted dates
> -
>
> Key: SPARK-18068
> URL: https://issues.apache.org/jira/browse/SPARK-18068
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Stephane Maarek
>Priority: Minor
>
> The following fail, but shouldn't according to the ISO 8601 standard (seconds 
> can be omitted). Not sure where the issue lies (probably an external library?)
> {code}
> scala> sc.parallelize(Seq("2016-10-07T11:15Z"))
> res1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at 
> parallelize at :25
> scala> res1.toDF
> res2: org.apache.spark.sql.DataFrame = [value: string]
> scala> res2.select("value").show()
> +-+
> |value|
> +-+
> |2016-10-07T11:15Z|
> +-+
> scala> import org.apache.spark.sql.types._
> import org.apache.spark.sql.types._
> scala> res2.select(col("value").cast(TimestampType)).show()
> +-+
> |value|
> +-+
> | null|
> +-+
> {code}
> And the schema usage errors out right away:
> {code}
> scala> val jsonRDD = sc.parallelize(Seq("""{"tst":"2016-10-07T11:15Z"}"""))
> jsonRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at 
> parallelize at :33
> scala> val schema = StructType(StructField("tst",TimestampType,true)::Nil)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(tst,TimestampType,true))
> scala> val df = spark.read.schema(schema).json(jsonRDD)
> df: org.apache.spark.sql.DataFrame = [tst: timestamp]
> scala> df.show()
> 16/10/24 13:06:27 ERROR Executor: Exception in task 6.0 in stage 5.0 (TID 23)
> java.lang.IllegalArgumentException: 2016-10-07T11:15Z
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown 
> Source)
>   at 
> org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
>  Source)
>   at 
> javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
>   at 
> javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
>   at 
> javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertObject(JacksonParser.scala:215)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertField(JacksonParser.scala:182)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$.convertRootField(JacksonParser.scala:73)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:288)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$2.apply(JacksonParser.scala:285)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2366)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:285)
>   at 
> org.apache.spark.sql.execution.datasources.json.JacksonParser$$anonfun$parseJson$1.apply(JacksonParser.scala:280)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   

[jira] [Resolved] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2016-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17875.
---
  Resolution: Later
Target Version/s:   (was: 2.1.0)

Although I successfully removed the dependency (see PR), I could never get this 
to not timeout in Python 3 tests, though it worked locally. I'm stumped, so, am 
just admitting defeat for now.

> Remove unneeded direct dependence on Netty 3.x
> --
>
> Key: SPARK-17875
> URL: https://issues.apache.org/jira/browse/SPARK-17875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Trivial
>
> The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
> used. It's best to remove the 3.x dependency (and while we're at it, update a 
> few things like license info)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Martha Solarte (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605175#comment-15605175
 ] 

Martha Solarte commented on SPARK-18009:


Hi, 
I got the same error with spark 2.0.0 but only if enable 
spark.sql.thriftServer.incrementalCollect=true.
Do you have this parameter enabled ?


> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: sql, thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hive.servic

[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605211#comment-15605211
 ] 

Peng Meng commented on SPARK-18088:
---

Hi [~josephkb] , I am not quite understand "Testing against only the p-value 
and not the test statistic does not really tell you anything. " SelectFPR in 
sklearn is only based on pValue. 

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups
> One major item: FPR is not implemented correctly.  Testing against only the 
> p-value and not the test statistic does not really tell you anything.  We 
> should follow sklearn, which allows a p-value threshold for any selection 
> method: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html]
> * In this PR, I'm just going to remove FPR completely.  We can add it back in 
> a follow-up PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16987) Add spark-default.conf property to define https port for spark history server

2016-10-25 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605225#comment-15605225
 ] 

chie hayashida commented on SPARK-16987:


Can I work on this issue?

> Add spark-default.conf property to define https port for spark history server
> -
>
> Key: SPARK-16987
> URL: https://issues.apache.org/jira/browse/SPARK-16987
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> With SPARK-2750, Spark History server UI becomes accessible on https port.
> Currently, https port is pre-defined to http port + 400. 
> Spark History server UI https port should not be pre-defined but it should be 
> configurable. 
> Thus, spark should to introduce new property to make spark history server 
> https port configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18092) add type cast to avoid error "Column prediction must be of type DoubleType but was actually FloatType"

2016-10-25 Thread albert fang (JIRA)
albert fang created SPARK-18092:
---

 Summary: add type cast to avoid error "Column prediction must be 
of type DoubleType but was actually FloatType"
 Key: SPARK-18092
 URL: https://issues.apache.org/jira/browse/SPARK-18092
 Project: Spark
  Issue Type: Bug
Reporter: albert fang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605231#comment-15605231
 ] 

Peng Meng commented on SPARK-18088:
---

In the previous implementation,  testing against only the statistic is not 
right. 
So I submit https://issues.apache.org/jira/browse/SPARK-17870 to fix that bug. 
Testing against only the p-value is ok. 3 of 5 feature selection methods of 
sklearn are only based on p-value. The other two is based on statistic. Because 
the degree of freedom is the same when compute chiSquare value, so sklearn can 
use statistic. 

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups
> One major item: FPR is not implemented correctly.  Testing against only the 
> p-value and not the test statistic does not really tell you anything.  We 
> should follow sklearn, which allows a p-value threshold for any selection 
> method: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html]
> * In this PR, I'm just going to remove FPR completely.  We can add it back in 
> a follow-up PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605239#comment-15605239
 ] 

Sean Owen commented on SPARK-18088:
---

[~josephkb] could we pause and discuss this? I'm not sure I agree with some 
your assertions here. It might be useful to review the discussion on the 
previous changes. For example, it's actually comparing on the raw stat that's 
incorrect in the context of Spark's implementation.

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups
> One major item: FPR is not implemented correctly.  Testing against only the 
> p-value and not the test statistic does not really tell you anything.  We 
> should follow sklearn, which allows a p-value threshold for any selection 
> method: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html]
> * In this PR, I'm just going to remove FPR completely.  We can add it back in 
> a follow-up PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18092) add type cast to avoid error "Column prediction must be of type DoubleType but was actually FloatType"

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18092:


Assignee: Apache Spark

> add type cast to avoid error "Column prediction must be of type DoubleType 
> but was actually FloatType"
> --
>
> Key: SPARK-18092
> URL: https://issues.apache.org/jira/browse/SPARK-18092
> Project: Spark
>  Issue Type: Bug
>Reporter: albert fang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18092) add type cast to avoid error "Column prediction must be of type DoubleType but was actually FloatType"

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605261#comment-15605261
 ] 

Apache Spark commented on SPARK-18092:
--

User 'd2Code' has created a pull request for this issue:
https://github.com/apache/spark/pull/15622

> add type cast to avoid error "Column prediction must be of type DoubleType 
> but was actually FloatType"
> --
>
> Key: SPARK-18092
> URL: https://issues.apache.org/jira/browse/SPARK-18092
> Project: Spark
>  Issue Type: Bug
>Reporter: albert fang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18092) add type cast to avoid error "Column prediction must be of type DoubleType but was actually FloatType"

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18092:


Assignee: (was: Apache Spark)

> add type cast to avoid error "Column prediction must be of type DoubleType 
> but was actually FloatType"
> --
>
> Key: SPARK-18092
> URL: https://issues.apache.org/jira/browse/SPARK-18092
> Project: Spark
>  Issue Type: Bug
>Reporter: albert fang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Mark Grover (JIRA)
Mark Grover created SPARK-18093:
---

 Summary: Fix default value test in SQLConfSuite to work regardless 
of warehouse dir's existence
 Key: SPARK-18093
 URL: https://issues.apache.org/jira/browse/SPARK-18093
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2, 2.1.0
Reporter: Mark Grover


At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
fails because left side of the assert doesn't have a trailing slash while the 
right does.

As [~srowen] mentions 
[here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the JVM 
adds a trailing slash if the directory exists and doesn't if it doesn't. I 
think it'd be good for the test to work regardless of the directory's existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-18093:
--
Priority: Minor  (was: Major)

Hm, I thought it would definitely exist at that point in the test. It passes on 
my Mac and on Jenkins builds. Still there's no harm in making it a non-issue in 
this test, and matching regardless of trailing slash.

> Fix default value test in SQLConfSuite to work regardless of warehouse dir's 
> existence
> --
>
> Key: SPARK-18093
> URL: https://issues.apache.org/jira/browse/SPARK-18093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mark Grover
>Priority: Minor
>
> At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
> in SQLConfSuite fails because left side of the assert doesn't have a trailing 
> slash while the right does.
> As [~srowen] mentions 
> [here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the 
> JVM adds a trailing slash if the directory exists and doesn't if it doesn't. 
> I think it'd be good for the test to work regardless of the directory's 
> existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18093:


Assignee: (was: Apache Spark)

> Fix default value test in SQLConfSuite to work regardless of warehouse dir's 
> existence
> --
>
> Key: SPARK-18093
> URL: https://issues.apache.org/jira/browse/SPARK-18093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mark Grover
>Priority: Minor
>
> At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
> in SQLConfSuite fails because left side of the assert doesn't have a trailing 
> slash while the right does.
> As [~srowen] mentions 
> [here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the 
> JVM adds a trailing slash if the directory exists and doesn't if it doesn't. 
> I think it'd be good for the test to work regardless of the directory's 
> existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605355#comment-15605355
 ] 

Apache Spark commented on SPARK-18093:
--

User 'markgrover' has created a pull request for this issue:
https://github.com/apache/spark/pull/15623

> Fix default value test in SQLConfSuite to work regardless of warehouse dir's 
> existence
> --
>
> Key: SPARK-18093
> URL: https://issues.apache.org/jira/browse/SPARK-18093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mark Grover
>Priority: Minor
>
> At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
> in SQLConfSuite fails because left side of the assert doesn't have a trailing 
> slash while the right does.
> As [~srowen] mentions 
> [here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the 
> JVM adds a trailing slash if the directory exists and doesn't if it doesn't. 
> I think it'd be good for the test to work regardless of the directory's 
> existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18093:


Assignee: Apache Spark

> Fix default value test in SQLConfSuite to work regardless of warehouse dir's 
> existence
> --
>
> Key: SPARK-18093
> URL: https://issues.apache.org/jira/browse/SPARK-18093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mark Grover
>Assignee: Apache Spark
>Priority: Minor
>
> At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
> in SQLConfSuite fails because left side of the assert doesn't have a trailing 
> slash while the right does.
> As [~srowen] mentions 
> [here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the 
> JVM adds a trailing slash if the directory exists and doesn't if it doesn't. 
> I think it'd be good for the test to work regardless of the directory's 
> existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605356#comment-15605356
 ] 

Mark Grover commented on SPARK-18093:
-

Yeah, I thought so too - but it failed on two different environments for me - 
an internal Jenkins job and my mac. Perhaps, it's related to some of the 
profiles/properties I am setting?

Anyways, filed a PR. Thanks!

> Fix default value test in SQLConfSuite to work regardless of warehouse dir's 
> existence
> --
>
> Key: SPARK-18093
> URL: https://issues.apache.org/jira/browse/SPARK-18093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mark Grover
>Priority: Minor
>
> At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
> in SQLConfSuite fails because left side of the assert doesn't have a trailing 
> slash while the right does.
> As [~srowen] mentions 
> [here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the 
> JVM adds a trailing slash if the directory exists and doesn't if it doesn't. 
> I think it'd be good for the test to work regardless of the directory's 
> existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18093) Fix default value test in SQLConfSuite to work regardless of warehouse dir's existence

2016-10-25 Thread Mark Grover (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Grover updated SPARK-18093:

Description: 
At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} in 
SQLConfSuite fails because left side of the assert doesn't have a trailing 
slash while the right does.

As [~srowen] mentions 
[here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the JVM 
adds a trailing slash if the directory exists and doesn't if it doesn't. I 
think it'd be good for the test to work regardless of the directory's existence.

  was:
At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
fails because left side of the assert doesn't have a trailing slash while the 
right does.

As [~srowen] mentions 
[here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the JVM 
adds a trailing slash if the directory exists and doesn't if it doesn't. I 
think it'd be good for the test to work regardless of the directory's existence.


> Fix default value test in SQLConfSuite to work regardless of warehouse dir's 
> existence
> --
>
> Key: SPARK-18093
> URL: https://issues.apache.org/jira/browse/SPARK-18093
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mark Grover
>
> At least on my mac (with JDK 1.7.0_67), {{default value of WAREHOUSE_PATH}} 
> in SQLConfSuite fails because left side of the assert doesn't have a trailing 
> slash while the right does.
> As [~srowen] mentions 
> [here|https://github.com/apache/spark/pull/15382#discussion_r84240197], the 
> JVM adds a trailing slash if the directory exists and doesn't if it doesn't. 
> I think it'd be good for the test to work regardless of the directory's 
> existence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15819) Add KMeanSummary in KMeans of PySpark

2016-10-25 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15819:

Shepherd: Yanbo Liang
Assignee: Jeff Zhang

> Add KMeanSummary in KMeans of PySpark
> -
>
> Key: SPARK-15819
> URL: https://issues.apache.org/jira/browse/SPARK-15819
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>
> There's no corresponding python api for KMeansSummary, it would be nice to 
> have it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18094) Move group analytics test cases from `SQLQuerySuite` into a query file test

2016-10-25 Thread Jiang Xingbo (JIRA)
Jiang Xingbo created SPARK-18094:


 Summary: Move group analytics test cases from `SQLQuerySuite` into 
a query file test
 Key: SPARK-18094
 URL: https://issues.apache.org/jira/browse/SPARK-18094
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Jiang Xingbo
Priority: Minor


Currently we have several test cases for group analytics(ROLLUP/CUBE/GROUPING 
SETS) in `SQLQuerySuite`, should better move them into a query file test.

This is followup work of SPARK-18045



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18094) Move group analytics test cases from `SQLQuerySuite` into a query file test

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605655#comment-15605655
 ] 

Apache Spark commented on SPARK-18094:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15624

> Move group analytics test cases from `SQLQuerySuite` into a query file test
> ---
>
> Key: SPARK-18094
> URL: https://issues.apache.org/jira/browse/SPARK-18094
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> Currently we have several test cases for group analytics(ROLLUP/CUBE/GROUPING 
> SETS) in `SQLQuerySuite`, should better move them into a query file test.
> This is followup work of SPARK-18045



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18094) Move group analytics test cases from `SQLQuerySuite` into a query file test

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18094:


Assignee: (was: Apache Spark)

> Move group analytics test cases from `SQLQuerySuite` into a query file test
> ---
>
> Key: SPARK-18094
> URL: https://issues.apache.org/jira/browse/SPARK-18094
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> Currently we have several test cases for group analytics(ROLLUP/CUBE/GROUPING 
> SETS) in `SQLQuerySuite`, should better move them into a query file test.
> This is followup work of SPARK-18045



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18094) Move group analytics test cases from `SQLQuerySuite` into a query file test

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18094:


Assignee: Apache Spark

> Move group analytics test cases from `SQLQuerySuite` into a query file test
> ---
>
> Key: SPARK-18094
> URL: https://issues.apache.org/jira/browse/SPARK-18094
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Minor
>
> Currently we have several test cases for group analytics(ROLLUP/CUBE/GROUPING 
> SETS) in `SQLQuerySuite`, should better move them into a query file test.
> This is followup work of SPARK-18045



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18095) There is a display problem in spark UI storage tab when rdd was persisted in multiple replicas

2016-10-25 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-18095:
--

 Summary: There is a display problem in spark UI storage tab when 
rdd was persisted in multiple replicas
 Key: SPARK-18095
 URL: https://issues.apache.org/jira/browse/SPARK-18095
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Weichen Xu


There is a display problem in spark UI storage tab when rdd was persisted in 
multiple replicas.

e.g, if we use MEMORY_AND_DISK_2, it will show the persisting status as:

|| Block Name || Storage Level || Size in Memory|| Size on Disk|| Executors ||
|rdd_1_0|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop0:47393|
|rdd_1_0|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop0:47393|
|rdd_1_1|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop2:34284|
|rdd_1_1|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop2:34284|




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18096) Spark on have - 'Update' save mode

2016-10-25 Thread David Hodeffi (JIRA)
David Hodeffi created SPARK-18096:
-

 Summary: Spark on have - 'Update' save mode
 Key: SPARK-18096
 URL: https://issues.apache.org/jira/browse/SPARK-18096
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: David Hodeffi


when creating ETL with Spark on Hive, it is needed to update incrementally the 
destination table. 
In case it is partitioned table it means that we don't need to update all 
partitions, but just the one who mutated.

right now there is only one way to update a Dataframe which is 
SaveMode.Overwrite , the problem is that when doing it incrementally you don't 
need to update all partitions but just those who changed/updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18095) There is a display problem in spark UI storage tab when rdd was persisted in multiple replicas

2016-10-25 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-18095:
---
Description: 
There is a display problem in spark UI storage tab when rdd was persisted in 
multiple replicas.

e.g, if we use MEMORY_AND_DISK_2, it will show the persisting status as:

|| Block Name || Storage Level || Size in Memory|| Size on Disk|| Executors ||
|rdd_1_0|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop0:47393|
|rdd_1_0|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop0:47393|
|rdd_1_1|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop2:34284|
|rdd_1_1|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop2:34284|

and there are some duplicated items in the displayed table, and the storage 
level column 1x and 2x replicas both exists will cause user confusing.

  was:
There is a display problem in spark UI storage tab when rdd was persisted in 
multiple replicas.

e.g, if we use MEMORY_AND_DISK_2, it will show the persisting status as:

|| Block Name || Storage Level || Size in Memory|| Size on Disk|| Executors ||
|rdd_1_0|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop0:47393|
|rdd_1_0|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop0:47393|
|rdd_1_1|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop2:34284|
|rdd_1_1|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
hadoop2:34284|



> There is a display problem in spark UI storage tab when rdd was persisted in 
> multiple replicas
> --
>
> Key: SPARK-18095
> URL: https://issues.apache.org/jira/browse/SPARK-18095
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> There is a display problem in spark UI storage tab when rdd was persisted in 
> multiple replicas.
> e.g, if we use MEMORY_AND_DISK_2, it will show the persisting status as:
> || Block Name || Storage Level || Size in Memory|| Size on Disk|| Executors ||
> |rdd_1_0|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop0:47393|
> |rdd_1_0|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop0:47393|
> |rdd_1_1|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop2:34284|
> |rdd_1_1|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop2:34284|
> and there are some duplicated items in the displayed table, and the storage 
> level column 1x and 2x replicas both exists will cause user confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18095) There is a display problem in spark UI storage tab when rdd was persisted in multiple replicas

2016-10-25 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605712#comment-15605712
 ] 

Weichen Xu commented on SPARK-18095:


I am working on it...

> There is a display problem in spark UI storage tab when rdd was persisted in 
> multiple replicas
> --
>
> Key: SPARK-18095
> URL: https://issues.apache.org/jira/browse/SPARK-18095
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> There is a display problem in spark UI storage tab when rdd was persisted in 
> multiple replicas.
> e.g, if we use MEMORY_AND_DISK_2, it will show the persisting status as:
> || Block Name || Storage Level || Size in Memory|| Size on Disk|| Executors ||
> |rdd_1_0|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop0:47393|
> |rdd_1_0|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop0:47393|
> |rdd_1_1|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop2:34284|
> |rdd_1_1|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop2:34284|
> and there are some duplicated items in the displayed table, and the storage 
> level column 1x and 2x replicas both exists will cause user confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605719#comment-15605719
 ] 

Apache Spark commented on SPARK-17748:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/15625

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
> Fix For: 2.1.0
>
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt

2016-10-25 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18097:
--

 Summary: Can't drop a table from Hive if the schema is corrupt
 Key: SPARK-18097
 URL: https://issues.apache.org/jira/browse/SPARK-18097
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Davies Liu


When the schema of Hive table is broken, we can't drop the table using Spark 
SQL, for example
{code}
Error in SQL statement: QueryExecutionException: FAILED: 
IllegalArgumentException Error: > expected at the position 4443 of 
'struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_EDDIRECT:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_THIRDPARTY:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DOMINION:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,EBIZAUTOS:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,VAST_HOSTED:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CO:string:struct'
 but ':' is found.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18095) There is a display problem in spark UI storage tab when rdd was persisted in multiple replicas

2016-10-25 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-18095:
---
Comment: was deleted

(was: I am working on it...)

> There is a display problem in spark UI storage tab when rdd was persisted in 
> multiple replicas
> --
>
> Key: SPARK-18095
> URL: https://issues.apache.org/jira/browse/SPARK-18095
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> There is a display problem in spark UI storage tab when rdd was persisted in 
> multiple replicas.
> e.g, if we use MEMORY_AND_DISK_2, it will show the persisting status as:
> || Block Name || Storage Level || Size in Memory|| Size on Disk|| Executors ||
> |rdd_1_0|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop0:47393|
> |rdd_1_0|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop0:47393|
> |rdd_1_1|Memory Deserialized 1x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop2:34284|
> |rdd_1_1|Memory Deserialized 2x Replicated|176.0B|0.0B|hadoop2:48622 
> hadoop2:34284|
> and there are some duplicated items in the displayed table, and the storage 
> level column 1x and 2x replicas both exists will cause user confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt

2016-10-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18097:
---
Description: 
When the schema of Hive table is broken, we can't drop the table using Spark 
SQL, for example
{code}
Error in SQL statement: QueryExecutionException: FAILED: 
IllegalArgumentException Error: > expected at the position 10 of 
'ss:string:struct<>' but ':' is found.
{code}

  was:
When the schema of Hive table is broken, we can't drop the table using Spark 
SQL, for example
{code}
Error in SQL statement: QueryExecutionException: FAILED: 
IllegalArgumentException Error: > expected at the position 4443 of 
'struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_EDDIRECT:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_THIRDPARTY:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DOMINION:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,EBIZAUTOS:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,VAST_HOSTED:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CO:string:struct'
 but ':' is found.
{code}


> Can't drop a table from Hive if the schema is corrupt
> -
>
> Key: SPARK-18097
> URL: https://issues.apache.org/jira/browse/SPARK-18097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>
> When the schema of Hive table is broken, we can't drop the table using Spark 
> SQL, for example
> {code}
> Error in SQL statement: QueryExecutionException: FAILED: 
> IllegalArgumentException Error: > expected at the position 10 of 
> 'ss:string:struct<>' but ':' is found.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt

2016-10-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18097:
---
Description: 
When the schema of Hive table is broken, we can't drop the table using Spark 
SQL, for example
{code}
Error in SQL statement: QueryExecutionException: FAILED: 
IllegalArgumentException Error: > expected at the position 10 of 
'ss:string:struct<>' but ':' is found.
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:336)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:480)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:104)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:353)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:351)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:351)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:228)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:227)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:255)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:126)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:267)
at 
org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:753)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.Dataset.(Dataset.scala:186)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
at org.apache.spark.sql.SparkSession.

[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server

2016-10-25 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605815#comment-15605815
 ] 

Marcelo Vanzin commented on SPARK-18085:


It's mildly related. The changes here don't do anything to help with the first 
SHS startup, which will still be slow. Subsequent startups would be fast, 
though, since the data would already be available locally.

> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6951) History server slow startup if the event log directory is large

2016-10-25 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-6951:
---

> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large

2016-10-25 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605820#comment-15605820
 ] 

Marcelo Vanzin commented on SPARK-6951:
---

I reopened this after discussion in the bug; the other change (SPARK-18010) 
makes startup a little faster, but not necessarily fast, for large directories 
/ log files.

> History server slow startup if the event log directory is large
> ---
>
> Key: SPARK-6951
> URL: https://issues.apache.org/jira/browse/SPARK-6951
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
>
> I started my history server, then navigated to the web UI where I expected to 
> be able to view some completed applications, but the webpage was not 
> available. It turned out that the History Server was not finished parsing all 
> of the event logs in the event log directory that I had specified. I had 
> accumulated a lot of event logs from months of running Spark, so it would 
> have taken a very long time for the History Server to crunch through them 
> all. I purged the event log directory and started from scratch, and the UI 
> loaded immediately.
> We should have a pagination strategy or parse the directory lazily to avoid 
> needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18098) Broadcast creates 1 instance / core, not 1 instance / executor

2016-10-25 Thread Anthony Sciola (JIRA)
Anthony Sciola created SPARK-18098:
--

 Summary: Broadcast creates 1 instance / core, not 1 instance / 
executor
 Key: SPARK-18098
 URL: https://issues.apache.org/jira/browse/SPARK-18098
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
Reporter: Anthony Sciola


I've created my spark executors with $SPARK_HOME/sbin/start-slave.sh -c 7 -m 55g

When I run a job which broadcasts data, it appears each *thread* requests and 
receives a copy of the broadcast object, not each *executor*. This means I need 
7x as much memory for the broadcasted item because I have 7 cores.

The problem appears to be due to a lack of synchronization around requesting 
broadcast items.

The only workaround I've come up with is writing the data out to HDFS, 
broadcasting the paths, and doing a synchronized load from HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17829) Stable format for offset log

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17829:


Assignee: Apache Spark  (was: Tyson Condie)

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17829) Stable format for offset log

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605879#comment-15605879
 ] 

Apache Spark commented on SPARK-17829:
--

User 'tcondie' has created a pull request for this issue:
https://github.com/apache/spark/pull/15626

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17829) Stable format for offset log

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17829:


Assignee: Tyson Condie  (was: Apache Spark)

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18098) Broadcast creates 1 instance / core, not 1 instance / executor

2016-10-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605898#comment-15605898
 ] 

Sean Owen commented on SPARK-18098:
---

It shouldn't work that way. The value is loaded in a lazy val, at least. I 
think I can imagine cases where you would end up with several per executor but 
they're not the normal use cases. Can you say more about what you're executing 
or what you're seeing?

> Broadcast creates 1 instance / core, not 1 instance / executor
> --
>
> Key: SPARK-18098
> URL: https://issues.apache.org/jira/browse/SPARK-18098
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Anthony Sciola
>
> I've created my spark executors with $SPARK_HOME/sbin/start-slave.sh -c 7 -m 
> 55g
> When I run a job which broadcasts data, it appears each *thread* requests and 
> receives a copy of the broadcast object, not each *executor*. This means I 
> need 7x as much memory for the broadcasted item because I have 7 cores.
> The problem appears to be due to a lack of synchronization around requesting 
> broadcast items.
> The only workaround I've come up with is writing the data out to HDFS, 
> broadcasting the paths, and doing a synchronized load from HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18010) Remove unneeded heavy work performed by FsHistoryProvider for building up the application listing UI page

2016-10-25 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18010.

   Resolution: Fixed
 Assignee: Vinayak Joshi
Fix Version/s: 2.1.0

> Remove unneeded heavy work performed by FsHistoryProvider for building up the 
> application listing UI page
> -
>
> Key: SPARK-18010
> URL: https://issues.apache.org/jira/browse/SPARK-18010
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 1.6.2, 2.0.1, 2.1.0
>Reporter: Vinayak Joshi
>Assignee: Vinayak Joshi
> Fix For: 2.1.0
>
>
> There are known complaints/cribs about History Server's Application List not 
> updating quickly enough when the event log files that need replay are huge. 
> Currently, the FsHistoryProvider design causes the entire event log file to 
> be replayed when building the initial application listing (refer the method 
> mergeApplicationListing(fileStatus: FileStatus) ). The process of replay 
> involves:
>  - each line in the event log being read as a string,
>  - parsing the string to a Json structure
>  - converting the Json to the corresponding Scala classes with nested 
> structures
> Particularly the part involving parsing string to Json and then to Scala 
> classes is expensive. Tests show that majority of time spent in replay is in 
> doing this work. 
> When the replay is performed for building the application listing, the only 
> two events that the code really cares for are "SparkListenerApplicationStart" 
> and "SparkListenerApplicationEnd" - since the only listener attached to the 
> ReplayListenerBus at that point is the ApplicationEventListener. This means 
> that when processing an event log file with a huge number (hundreds of 
> thousands, can be more) of events, the work done to deserialize all of these 
> event,  and then replay them is not needed. Only two events are what we're 
> interested in, and this can be used to ensure that when replay is performed 
> for the purpose of building the application list, we only make the effort to 
> replay these two events and not others. 
> My tests show that this drastically improves application list load time. For 
> a 150MB event log from a user, with over 100,000 events, the load time (local 
> on my mac) comes down from about 16 secs to under 1 second using this 
> approach. For customers that typically execute applications with large event 
> logs, and thus have multiple large event logs present, this can speed up how 
> soon the history server UI lists the apps considerably.
> I will be updating a pull request with take at fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql

2016-10-25 Thread Matthew Porter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Porter updated SPARK-16183:
---
Affects Version/s: 2.0.0

> Large Spark SQL commands cause StackOverflowError in parser when using 
> sqlContext.sql
> -
>
> Key: SPARK-16183
> URL: https://issues.apache.org/jira/browse/SPARK-16183
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Running on AWS EMR
>Reporter: Matthew Porter
>
> Hi,
> I have created a PySpark SQL-based tool which auto-generates a complex SQL 
> command to be run via sqlContext.sql(cmd) based on a large number of 
> parameters. As the number of input files to be filtered and joined in this 
> query grows, so does the length of the SQL query. The tool runs fine up until 
> about 200+ files are included in the join, at which point the SQL command 
> becomes very long (~100K characters). It is only on these longer queries that 
> Spark fails, throwing an exception due to what seems to be too much recursion 
> occurring within the SparkSQL parser:
> {code}
> Traceback (most recent call last):
> ...
> merged_df = sqlsc.sql(cmd)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 
> 580, in sql
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql.
> : java.lang.StackOverflowError
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> sca

[jira] [Created] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives

2016-10-25 Thread Kishor Patil (JIRA)
Kishor Patil created SPARK-18099:


 Summary: Spark distributed cache should throw exception if same 
file is specified to dropped in --files --archives
 Key: SPARK-18099
 URL: https://issues.apache.org/jira/browse/SPARK-18099
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.1, 2.0.0
Reporter: Kishor Patil


Recently, for the changes to [SPARK-14423] Handle jar conflict issue when 
uploading to distributed cache
If by default yarn#client will upload all the --files and --archives in 
assembly to HDFS staging folder. It should throw if file appears in both 
--files and --archives exception to know whether uncompress or leave the file 
compressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606068#comment-15606068
 ] 

Apache Spark commented on SPARK-18099:
--

User 'kishorvpatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/15627

> Spark distributed cache should throw exception if same file is specified to 
> dropped in --files --archives
> -
>
> Key: SPARK-18099
> URL: https://issues.apache.org/jira/browse/SPARK-18099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kishor Patil
>
> Recently, for the changes to [SPARK-14423] Handle jar conflict issue when 
> uploading to distributed cache
> If by default yarn#client will upload all the --files and --archives in 
> assembly to HDFS staging folder. It should throw if file appears in both 
> --files and --archives exception to know whether uncompress or leave the file 
> compressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18099:


Assignee: (was: Apache Spark)

> Spark distributed cache should throw exception if same file is specified to 
> dropped in --files --archives
> -
>
> Key: SPARK-18099
> URL: https://issues.apache.org/jira/browse/SPARK-18099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kishor Patil
>
> Recently, for the changes to [SPARK-14423] Handle jar conflict issue when 
> uploading to distributed cache
> If by default yarn#client will upload all the --files and --archives in 
> assembly to HDFS staging folder. It should throw if file appears in both 
> --files and --archives exception to know whether uncompress or leave the file 
> compressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18099:


Assignee: Apache Spark

> Spark distributed cache should throw exception if same file is specified to 
> dropped in --files --archives
> -
>
> Key: SPARK-18099
> URL: https://issues.apache.org/jira/browse/SPARK-18099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kishor Patil
>Assignee: Apache Spark
>
> Recently, for the changes to [SPARK-14423] Handle jar conflict issue when 
> uploading to distributed cache
> If by default yarn#client will upload all the --files and --archives in 
> assembly to HDFS staging folder. It should throw if file appears in both 
> --files and --archives exception to know whether uncompress or leave the file 
> compressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606142#comment-15606142
 ] 

Joseph K. Bradley commented on SPARK-18088:
---

Ahh, you're right, sorry, I see that now that I'm looking at master.  I'll link 
the follow-up JIRA to the original JIRA.

And I agree my assertion about p-value wasn't correct.  Will fix.  Thanks!

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups
> One major item: FPR is not implemented correctly.  Testing against only the 
> p-value and not the test statistic does not really tell you anything.  We 
> should follow sklearn, which allows a p-value threshold for any selection 
> method: 
> [http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html]
> * In this PR, I'm just going to remove FPR completely.  We can add it back in 
> a follow-up PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18088:
--
Description: 
There are several cleanups I'd like to make as a follow-up to the PRs from 
[SPARK-17017]:
* Rename selectorType values to match corresponding Params
* Add Since tags where missing
* a few minor cleanups

  was:
There are several cleanups I'd like to make as a follow-up to the PRs from 
[SPARK-17017]:
* Rename selectorType values to match corresponding Params
* Add Since tags where missing
* a few minor cleanups

One major item: FPR is not implemented correctly.  Testing against only the 
p-value and not the test statistic does not really tell you anything.  We 
should follow sklearn, which allows a p-value threshold for any selection 
method: 
[http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html]
* In this PR, I'm just going to remove FPR completely.  We can add it back in a 
follow-up PR.


> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18088:
--
Issue Type: Improvement  (was: Bug)

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18088:
--
Priority: Minor  (was: Major)

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606146#comment-15606146
 ] 

Joseph K. Bradley commented on SPARK-18088:
---

How do you feel about renaming the selectorType values to match the parameters? 
 I'd like to call them "numTopFeatures", "percentile" and "fpr".

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17692) Document ML/MLlib behavior changes in Spark 2.1

2016-10-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606178#comment-15606178
 ] 

Joseph K. Bradley commented on SPARK-17692:
---

[SPARK-17870] changes the output of ChiSqSelector.  It is a bug fix, so it is 
an acceptable change of behavior.

> Document ML/MLlib behavior changes in Spark 2.1
> ---
>
> Key: SPARK-17692
> URL: https://issues.apache.org/jira/browse/SPARK-17692
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>  Labels: 2.1.0
>
> This JIRA records behavior changes of ML/MLlib between 2.0 and 2.1, so we can 
> note those changes (if any) in the user guide's Migration Guide section. If 
> you found one, please comment below and link the corresponding JIRA here.
> * SPARK-17389: Reduce KMeans default k-means|| init steps to 2 from 5.  
> * SPARK-17870: ChiSquareSelector use pValue rather than raw statistic for 
> SelectKBest features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18070) binary operator should not consider nullability when comparing input types

2016-10-25 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18070.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

Issue resolved by pull request 15606
[https://github.com/apache/spark/pull/15606]

> binary operator should not consider nullability when comparing input types
> --
>
> Key: SPARK-18070
> URL: https://issues.apache.org/jira/browse/SPARK-18070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.2, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18088:
--
Comment: was deleted

(was: Calling this a bug since FPR is not implemented correctly.)

> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18100) Improve the performance of get_json_object using Gson

2016-10-25 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18100:
--

 Summary: Improve the performance of get_json_object using Gson
 Key: SPARK-18100
 URL: https://issues.apache.org/jira/browse/SPARK-18100
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


Based on some benchmark here: 
http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/,
 which said Gson could be much faster than Jackson, maybe it could be used to 
improve the performance of get_json_object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18100) Improve the performance of get_json_object using Gson

2016-10-25 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18100:
---
Issue Type: Improvement  (was: Bug)

> Improve the performance of get_json_object using Gson
> -
>
> Key: SPARK-18100
> URL: https://issues.apache.org/jira/browse/SPARK-18100
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> Based on some benchmark here: 
> http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/,
>  which said Gson could be much faster than Jackson, maybe it could be used to 
> improve the performance of get_json_object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18019) Log instrumentation in GBTs

2016-10-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18019.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15574
[https://github.com/apache/spark/pull/15574]

> Log instrumentation in GBTs
> ---
>
> Key: SPARK-18019
> URL: https://issues.apache.org/jira/browse/SPARK-18019
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
> Fix For: 2.1.0
>
>
> Sub-task for adding instrumentation to GBTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class

2016-10-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606347#comment-15606347
 ] 

Apache Spark commented on SPARK-17471:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/15628

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17471) Add compressed method for Matrix class

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17471:


Assignee: (was: Apache Spark)

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17471) Add compressed method for Matrix class

2016-10-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17471:


Assignee: Apache Spark

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields

2016-10-25 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18101:
--

 Summary: ExternalCatalogSuite should test with mixed case fields
 Key: SPARK-18101
 URL: https://issues.apache.org/jira/browse/SPARK-18101
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Eric Liang


Currently, it uses field names such as "a" and "b" which are not useful for 
testing case preservation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields

2016-10-25 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18101:
---
Issue Type: Sub-task  (was: Test)
Parent: SPARK-17861

> ExternalCatalogSuite should test with mixed case fields
> ---
>
> Key: SPARK-18101
> URL: https://issues.apache.org/jira/browse/SPARK-18101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> Currently, it uses field names such as "a" and "b" which are not useful for 
> testing case preservation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18102) Failed to deserialize the result of task

2016-10-25 Thread Davies Liu (JIRA)
Davies Liu created SPARK-18102:
--

 Summary: Failed to deserialize the result of task
 Key: SPARK-18102
 URL: https://issues.apache.org/jira/browse/SPARK-18102
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}
16/10/25 15:17:04 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() for one-way message.
java.lang.ClassNotFoundException: org.apache.spark.util*SerializableBuffer not 
found in 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@3d98d138
at 
com.databricks.backend.daemon.driver.ClassLoaders$MultiReplClassLoader.loadClass(ClassLoaders.scala:115)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:259)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:308)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:258)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:257)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:578)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570)
at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel

[jira] [Created] (SPARK-18103) Rename *FileCatalog to *FileProvider

2016-10-25 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18103:
--

 Summary: Rename *FileCatalog to *FileProvider
 Key: SPARK-18103
 URL: https://issues.apache.org/jira/browse/SPARK-18103
 Project: Spark
  Issue Type: Improvement
Reporter: Eric Liang
Priority: Minor


In the SQL component there are too many different components called some 
variant of *Catalog, which is quite confusing. We should rename the subclasses 
of FileCatalog to avoid this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-25 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606484#comment-15606484
 ] 

Nicholas Chammas commented on SPARK-18084:
--

cc [~marmbrus] - Dunno if this is actually bug or just an unsupported or 
inappropriate use case.

> write.partitionBy() does not recognize nested columns that select() can access
> --
>
> Key: SPARK-18084
> URL: https://issues.apache.org/jira/browse/SPARK-18084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a simple repro in the PySpark shell:
> {code}
> from pyspark.sql import Row
> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> df = spark.createDataFrame(rdd)
> df.printSchema()
> df.select('a.b').show()  # works
> df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
> {code}
> Here's what I see when I run this:
> {code}
> >>> from pyspark.sql import Row
> >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> >>> df = spark.createDataFrame(rdd)
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: long (nullable = true)
> >>> df.show()
> +---+
> |  a|
> +---+
> |[5]|
> +---+
> >>> df.select('a.b').show()
> +---+
> |  b|
> +---+
> |  5|
> +---+
> >>> df.write.partitionBy('a.b').text('/tmp/test')
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
> : org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
> schema 
> StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/Cellar/apache-spark/2.0

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-25 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Environment: Pyspark 2.0.0, Ipython 4.2  (was: Pyspark 2.0.1, Ipython 4.2)

> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.0, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> {code}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> {code}
> Execution plan:
> {code}
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
> ^s)
>+- LogicalRDD [input#0L]
> {code}
> Executing test_def.show() after the above code in pyspark 2.0.1 yields:
> KeyError: 0
> Executing test_def.show() in pyspark 1.6.2 yields:
> {code}
> +-+--+
> |input|output|
> +-+--+
> |2|second|
> +-+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server

2016-10-25 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606523#comment-15606523
 ] 

Alex Bozarth commented on SPARK-18085:
--

I am *very* interested in working with you on this project and (post-Spark 
Summit) would love to discuss some of the UI ideas my team has been tossing 
around (a few covered in your non-goals).

> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15482) ClassCast exception when join two tables.

2016-10-25 Thread roberto sancho rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606601#comment-15606601
 ] 

roberto sancho rojas commented on SPARK-15482:
--

I have the same problem, whe i execute this code from spark 1.6 and HDP 
2.4.0.0-169 and PHOENIX 2.4.0
df = sqlContext.read \
  .format("org.apache.phoenix.spark") \
  .option("table", "TABLA") \
  .option("zkUrl", "XXX:/hbase-unsecure") \
  .load()
df.show()
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
org.apache.spark.sql.Row
at 
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)

here my claspathh:
/usr/hdp/2.4.0.0-169/phoenix/lib/phoenix-spark-4.4.0.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/hbase-client.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/hbase-common.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/phoenix-core-4.4.0.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/phoenix/lib/hbase-protocol.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar

> ClassCast exception when join two tables.
> -
>
> Key: SPARK-15482
> URL: https://issues.apache.org/jira/browse/SPARK-15482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Phoenix: 1.2
> Spark: 1.5.0-cdh5.5.1
>Reporter: jingtao
>
> I have two tables A and B in Phoenix.
> I load table 'A' as dataFrame 'ADF' using spark ,  and register dataFrame 
> ''ADF''  as temp table 'ATEMPTABLE'.
> B is the same as A.
> A --> ADF ---> ATEMPTABLE
> B --> BDF ---> BTEMPTABLE
> Then, i joins the two temp table 'ATEMPTABLE' and 'BTEMPTABLE' using spark 
> sql.
> Such as 'select count(*) from ATEMPTABLE join BTEMPTABLE on ...'
> It errors with the following message: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 6, hadoop05): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow cannot be cast to 
> org.apache.spark.sql.Row
> at 
> org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
> at scala.Option.foreach(Option.scala:236)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
> at org.apache.spark.util.EventLoop$$anon$1.run(E

[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-16988:
---
Assignee: chie hayashida

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-16988.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-16988:
---
Component/s: Spark Core

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-16988:
---
Component/s: (was: Spark Shell)

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-25 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-16988:
---
Component/s: (was: Spark Core)
 Web UI

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: chie hayashida
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18009:

Target Version/s: 2.0.1, 2.1.0

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:210)
>   ... 15 more
> Error: E

[jira] [Updated] (SPARK-18009) Spark 2.0.1 SQL Thrift Error

2016-10-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18009:

Labels: thrift  (was: sql thrift)

> Spark 2.0.1 SQL Thrift Error
> 
>
> Key: SPARK-18009
> URL: https://issues.apache.org/jira/browse/SPARK-18009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: apache hadoop 2.6.2 
> spark 2.0.1
>Reporter: Jerryjung
>Priority: Critical
>  Labels: thrift
>
> After deploy spark thrift server on YARN, then I tried to execute from the 
> beeline following command.
> > show databases;
> I've got this error message. 
> {quote}
> beeline> !connect jdbc:hive2://localhost:1 a a
> Connecting to jdbc:hive2://localhost:1
> 16/10/19 22:50:18 INFO Utils: Supplied authorities: localhost:1
> 16/10/19 22:50:18 INFO Utils: Resolved authority: localhost:1
> 16/10/19 22:50:18 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:1
> Connected to: Spark SQL (version 2.0.1)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:1> show databases;
> java.lang.IllegalStateException: Can't overwrite cause with 
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at java.lang.Throwable.initCause(Throwable.java:456)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:236)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toCause(HiveSQLException.java:197)
>   at 
> org.apache.hive.service.cli.HiveSQLException.(HiveSQLException.java:108)
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
>   at 
> org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
>   at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:42)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1794)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:860)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:713)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 669.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 669.0 (TID 3519, edw-014-22): java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow cannot be cast 
> to org.apache.spark.sql.catalyst.expressions.UnsafeRow
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.hive.service.cli.HiveSQLException.newInstance(HiveSQLException.java:244)
>   at 
> org.apache.hive.service.cli.HiveSQLException.toStackTrace(HiveSQLException.java:210)
>   ... 15 more
> Error

  1   2   >