[jira] [Commented] (SPARK-20067) Use treeString to print out the table schema for CatalogTable

2017-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937764#comment-15937764
 ] 

Apache Spark commented on SPARK-20067:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17394

> Use treeString to print out the table schema for CatalogTable
> -
>
> Key: SPARK-20067
> URL: https://issues.apache.org/jira/browse/SPARK-20067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, we are using {{sql}} to print the schema. To make the schema more 
> readable, we should use {{treeString}}, like what we did in Dataset API 
> {{printSchema}}
> Below is the current way:
> {noformat}
> Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), 
> `c`: STRING (nullable = true), `d`: STRING (nullable = true)>
> {noformat}
> After the change, it should look like
> {noformat}
> Schema: root
>  |-- a: string (nullable = true)
>  |-- b: integer (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: string (nullable = true)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20067) Use treeString to print out the table schema for CatalogTable

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20067:


Assignee: Xiao Li  (was: Apache Spark)

> Use treeString to print out the table schema for CatalogTable
> -
>
> Key: SPARK-20067
> URL: https://issues.apache.org/jira/browse/SPARK-20067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Currently, we are using {{sql}} to print the schema. To make the schema more 
> readable, we should use {{treeString}}, like what we did in Dataset API 
> {{printSchema}}
> Below is the current way:
> {noformat}
> Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), 
> `c`: STRING (nullable = true), `d`: STRING (nullable = true)>
> {noformat}
> After the change, it should look like
> {noformat}
> Schema: root
>  |-- a: string (nullable = true)
>  |-- b: integer (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: string (nullable = true)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20067) Use treeString to print out the table schema for CatalogTable

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20067:


Assignee: Apache Spark  (was: Xiao Li)

> Use treeString to print out the table schema for CatalogTable
> -
>
> Key: SPARK-20067
> URL: https://issues.apache.org/jira/browse/SPARK-20067
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, we are using {{sql}} to print the schema. To make the schema more 
> readable, we should use {{treeString}}, like what we did in Dataset API 
> {{printSchema}}
> Below is the current way:
> {noformat}
> Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), 
> `c`: STRING (nullable = true), `d`: STRING (nullable = true)>
> {noformat}
> After the change, it should look like
> {noformat}
> Schema: root
>  |-- a: string (nullable = true)
>  |-- b: integer (nullable = true)
>  |-- c: string (nullable = true)
>  |-- d: string (nullable = true)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20067) Use treeString to print out the table schema for CatalogTable

2017-03-22 Thread Xiao Li (JIRA)
Xiao Li created SPARK-20067:
---

 Summary: Use treeString to print out the table schema for 
CatalogTable
 Key: SPARK-20067
 URL: https://issues.apache.org/jira/browse/SPARK-20067
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li
Assignee: Xiao Li


Currently, we are using {{sql}} to print the schema. To make the schema more 
readable, we should use {{treeString}}, like what we did in Dataset API 
{{printSchema}}

Below is the current way:
{noformat}
Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), `c`: 
STRING (nullable = true), `d`: STRING (nullable = true)>
{noformat}

After the change, it should look like
{noformat}
Schema: root
 |-- a: string (nullable = true)
 |-- b: integer (nullable = true)
 |-- c: string (nullable = true)
 |-- d: string (nullable = true)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19913) Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19913.
---
Resolution: Won't Fix

> Log warning rather than throw AnalysisException when output is partitioned 
> although format is memory, console or foreach
> 
>
> Key: SPARK-19913
> URL: https://issues.apache.org/jira/browse/SPARK-19913
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> When batches are executed with memory, console or foreach format, 
> `assertNotPartitioned` will check whether output is not partitioned and throw 
> AnalysisException in case it is.
> But I wonder it's better to log warning rather than throw the exception 
> because partitioning does not affect output for those formats but also does 
> not bring any negative impacts.
> Also, this assertion is not applied when the format is `console`. I think in 
> this case too, we should assert that .
> By fixing them, we can easily switch the format to memory or console for 
> debug purposes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions

2017-03-22 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937754#comment-15937754
 ] 

Liang-Chi Hsieh commented on SPARK-14083:
-

[~kiszk] Thanks for rebasing it. It is more convenient to continue the work. I 
will look into it to see where it can be improved. I will publish new branch if 
any progress.

> Analyze JVM bytecode and turn closures into Catalyst expressions
> 
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16060) Vectorized Orc reader

2017-03-22 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937711#comment-15937711
 ] 

Liang-Chi Hsieh commented on SPARK-16060:
-

cc [~rxin] If the approach based on Hive package is not ok for you, shall we 
directly go to make presto's ORC reader work for Spark SQL?

> Vectorized Orc reader
> -
>
> Key: SPARK-16060
> URL: https://issues.apache.org/jira/browse/SPARK-16060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive 
> Orc already support vectorization, we should add this support to improve Orc 
> reading performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2017-03-22 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937707#comment-15937707
 ] 

Liang-Chi Hsieh commented on SPARK-17556:
-

We may need to change the Target Version/s for this.


> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19169) columns changed orc table encouter 'IndexOutOfBoundsException' when read the old schema files

2017-03-22 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19169.
--
Resolution: Invalid

It seems the reporter is inactive, I can't reproduce this, it seems this JIRA 
does not describe how to reproduce, and the error message looks indicating the 
error from ORC.

I am resolving this JIRA. Please reopen this if I am mistaken.

> columns changed orc table encouter 'IndexOutOfBoundsException' when read the 
> old schema files
> -
>
> Key: SPARK-19169
> URL: https://issues.apache.org/jira/browse/SPARK-19169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: roncenzhao
>
> We hava an orc table called orc_test_tbl and hava inserted some data into it.
> After that, we change the table schema by droping some columns.
> When reading the old schema file, we get the follow exception.
> ```
> java.lang.IndexOutOfBoundsException: toIndex = 65
> at java.util.ArrayList.subListRangeCheck(ArrayList.java:962)
> at java.util.ArrayList.subList(ArrayList.java:954)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.getSchemaOnRead(RecordReaderFactory.java:161)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderFactory.createTreeReader(RecordReaderFactory.java:66)
> at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.(RecordReaderImpl.java:202)
> at 
> org.apache.hadoop.hive.ql.io.orc.ReaderImpl.rowsOptions(ReaderImpl.java:539)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$ReaderPair.(OrcRawRecordMerger.java:183)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger$OriginalReaderPair.(OrcRawRecordMerger.java:226)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcRawRecordMerger.(OrcRawRecordMerger.java:437)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getReader(OrcInputFormat.java:1215)
> at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getRecordReader(OrcInputFormat.java:1113)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19136) Aggregator with case class as output type fails with ClassCastException

2017-03-22 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937680#comment-15937680
 ] 

Hyukjin Kwon commented on SPARK-19136:
--

[~a1ray], do you think this JIRA is resolvable?

> Aggregator with case class as output type fails with ClassCastException
> ---
>
> Key: SPARK-19136
> URL: https://issues.apache.org/jira/browse/SPARK-19136
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Mathieu D
>Priority: Minor
>
> {{Aggregator}} with a case-class as output type returns a Row that cannot be 
> cast back to this type, it fails with {{ClassCastException}}.
> Here is a dummy example to reproduce the problem 
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
> import org.apache.spark.sql.expressions.Aggregator
> import spark.implicits._
> case class MinMax(min: Int, max: Int)
> case class MinMaxAgg() extends Aggregator[Row, (Int, Int), MinMax] with 
> Serializable {
>   def zero: (Int, Int) = (Int.MaxValue, Int.MinValue)
>   def reduce(b: (Int, Int), a: Row): (Int, Int) = (Math.min(b._1, 
> a.getAs[Int](0)), Math.max(b._2, a.getAs[Int](0)))
>   def finish(r: (Int, Int)): MinMax = MinMax(r._1, r._2)
>   def merge(b1: (Int, Int), b2: (Int, Int)): (Int, Int) = (Math.min(b1._1, 
> b2._1), Math.max(b1._2, b2._2))
>   def bufferEncoder: Encoder[(Int, Int)] = ExpressionEncoder()
>   def outputEncoder: Encoder[MinMax] = ExpressionEncoder()
> }
> val ds = Seq(1, 2, 3, 4).toDF("col1")
> val agg = ds.agg(MinMaxAgg().toColumn.alias("minmax"))
> {code}
> bq. {code}
> ds: org.apache.spark.sql.DataFrame = [col1: int]
> agg: org.apache.spark.sql.DataFrame = [minmax: struct]
> {code}
> {code}agg.printSchema(){code}
> bq. {code}
> root
>  |-- minmax: struct (nullable = true)
>  ||-- min: integer (nullable = false)
>  ||-- max: integer (nullable = false)
> {code}
> {code}agg.head(){code}
> bq. {code}
> res1: org.apache.spark.sql.Row = [[1,4]]
> {code}
> {code}agg.head().getAs[MinMax](0){code}
> bq. {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to line4c81e18af34342cda654c381ee91139525.$read$$iw$$iw$$iw$$iw$MinMax
> [...]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20061) Reading a file with colon (:) from S3 fails with URISyntaxException

2017-03-22 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937675#comment-15937675
 ] 

Genmao Yu commented on SPARK-20061:
---

Colon is not supported in hadoop, see 
[HDFS-13|https://issues.apache.org/jira/browse/HDFS-13]

> Reading a file with colon (:) from S3 fails with URISyntaxException
> ---
>
> Key: SPARK-20061
> URL: https://issues.apache.org/jira/browse/SPARK-20061
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
> Environment: EC2, AWS
>Reporter: Michel Lemay
>
> When reading a bunch of files from s3 using wildcards, it fails with the 
> following exception:
> {code}
> scala> val fn = "s3a://mybucket/path/*/"
> scala> val ds = spark.readStream.schema(schema).json(fn)
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.(Path.java:171)
>   at org.apache.hadoop.fs.Path.(Path.java:93)
>   at org.apache.hadoop.fs.Globber.glob(Globber.java:241)
>   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:131)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:127)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:127)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:124)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:138)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:229)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
>   at 
> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:133)
>   at 
> org.apache.spark.sql.streaming.DataStreamReader.json(DataStreamReader.scala:181)
>   ... 50 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
> 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 73 more
> {code}
> The file in question sits at the root of s3a://mybucket/path/
> {code}
> aws s3 ls s3://mybucket/path/
>PRE subfolder1/
>PRE subfolder2/
> ...
> 2017-01-06 20:33:46   1383 
> 2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
> ...
> {code}
> Removing the wildcard from path make it work but it obviously does misses all 
> files in subdirectories.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20023) Can not see table comment when describe formatted table

2017-03-22 Thread chenerlu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937670#comment-15937670
 ] 

chenerlu commented on SPARK-20023:
--

Hi, I review the PR and test this PR, then I found table comment can not be 
changed once it specified in "create table comment", users can not modify the 
table comment using alter table set tblproperties ("Comment" = "I will change 
the table comment") like before, I think we you offer the relative sql or 
interface for users to change table comment in the condition when specify table 
comment by mistake.Is this a bug or we have better solution?
[~cloud_fan] [~smilegator] [~ZenWzh]

> Can not see table comment when describe formatted table
> ---
>
> Key: SPARK-20023
> URL: https://issues.apache.org/jira/browse/SPARK-20023
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: chenerlu
>Assignee: Xiao Li
> Fix For: 2.2.0
>
>
> Spark 2.x implements create table by itself.
> https://github.com/apache/spark/commit/7d2ed8cc030f3d84fea47fded072c320c3d87ca7
> But in the implement mentioned above, it remove table comment from 
> properties, so user can not see table comment through run "describe formatted 
> table". Similarly, when user alters table comment, he still can not see the 
> change of table comment through run "describe formatted table".
> I wonder why we removed table comments, is this a bug?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20066) Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20066:


Assignee: (was: Apache Spark)

> Add explicit SecurityManager(SparkConf) constructor for backwards 
> compatibility with Java
> -
>
> Key: SPARK-20066
> URL: https://issues.apache.org/jira/browse/SPARK-20066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Mark Grover
>
> SPARK-19520 added an optional argument (ioEncryptionKey) to Security Manager 
> class. And, it has a default value, so life is great.
> However, that's not enough when invoking the class from Java. We didn't see 
> this before because the SecurityManager class is private to the spark package 
> and all the code that uses it is Scala.
> However, I have some code that was extending it, in Java, and that code 
> breaks because Java can't access that default value (more details 
> [here|http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const]).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20066) Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java

2017-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937666#comment-15937666
 ] 

Apache Spark commented on SPARK-20066:
--

User 'markgrover' has created a pull request for this issue:
https://github.com/apache/spark/pull/17393

> Add explicit SecurityManager(SparkConf) constructor for backwards 
> compatibility with Java
> -
>
> Key: SPARK-20066
> URL: https://issues.apache.org/jira/browse/SPARK-20066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Mark Grover
>
> SPARK-19520 added an optional argument (ioEncryptionKey) to Security Manager 
> class. And, it has a default value, so life is great.
> However, that's not enough when invoking the class from Java. We didn't see 
> this before because the SecurityManager class is private to the spark package 
> and all the code that uses it is Scala.
> However, I have some code that was extending it, in Java, and that code 
> breaks because Java can't access that default value (more details 
> [here|http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const]).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20066) Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20066:


Assignee: Apache Spark

> Add explicit SecurityManager(SparkConf) constructor for backwards 
> compatibility with Java
> -
>
> Key: SPARK-20066
> URL: https://issues.apache.org/jira/browse/SPARK-20066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Mark Grover
>Assignee: Apache Spark
>
> SPARK-19520 added an optional argument (ioEncryptionKey) to Security Manager 
> class. And, it has a default value, so life is great.
> However, that's not enough when invoking the class from Java. We didn't see 
> this before because the SecurityManager class is private to the spark package 
> and all the code that uses it is Scala.
> However, I have some code that was extending it, in Java, and that code 
> breaks because Java can't access that default value (more details 
> [here|http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const]).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread bruce xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937644#comment-15937644
 ] 

bruce xu edited comment on SPARK-19927 at 3/23/17 3:46 AM:
---

[~q79969786] Thx for response. your comment is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has been merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927's value is to deal with this problem. hope to review again or 
merge SPARK-13983 is a workaround.




was (Author: xwc3504):
[~q79969786] Thx for response. your comment is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has been merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927's value is to deal with this problem. hope to review again.



> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test.sql:
> -
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> -
>  
> when execute beeline command: bin/beeline  -f /tmp/test.sql  --hivevar 
> db_name=offline 
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from beeline CLI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread bruce xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937644#comment-15937644
 ] 

bruce xu edited comment on SPARK-19927 at 3/23/17 3:41 AM:
---

[~q79969786] Thx for response. your comment is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has been merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927's value is to deal with this problem. hope to review again.




was (Author: xwc3504):
[~q79969786] Thx for response. your comment is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927's value is to deal with this problem. hope to review again.



> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test.sql:
> -
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> -
>  
> when execute beeline command: bin/beeline  -f /tmp/test.sql  --hivevar 
> db_name=offline 
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from beeline CLI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread bruce xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937644#comment-15937644
 ] 

bruce xu edited comment on SPARK-19927 at 3/23/17 3:39 AM:
---

[~q79969786] Thx for response. your comment is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927's value is to deal with this problem. hope to review again.




was (Author: xwc3504):
[~q79969786] Thx for response. your comment is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927's value is to deal with this problem.



> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test.sql:
> -
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> -
>  
> when execute beeline command: bin/beeline  -f /tmp/test.sql  --hivevar 
> db_name=offline 
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from beeline CLI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20066) Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java

2017-03-22 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937658#comment-15937658
 ] 

Mark Grover commented on SPARK-20066:
-

I have attached some simple test code here: 
https://github.com/markgrover/spark-20066

With the current state of Spark,
mvn clean package -Dspark.version=2.1.0 fails.

mvn clean package -Dspark.version=2.0.0 passes.

> Add explicit SecurityManager(SparkConf) constructor for backwards 
> compatibility with Java
> -
>
> Key: SPARK-20066
> URL: https://issues.apache.org/jira/browse/SPARK-20066
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Mark Grover
>
> SPARK-19520 added an optional argument (ioEncryptionKey) to Security Manager 
> class. And, it has a default value, so life is great.
> However, that's not enough when invoking the class from Java. We didn't see 
> this before because the SecurityManager class is private to the spark package 
> and all the code that uses it is Scala.
> However, I have some code that was extending it, in Java, and that code 
> breaks because Java can't access that default value (more details 
> [here|http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const]).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread bruce xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937644#comment-15937644
 ] 

bruce xu edited comment on SPARK-19927 at 3/23/17 3:37 AM:
---

[~q79969786] Thx for response. your comment is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927's value is to deal with this problem.




was (Author: xwc3504):
[~q79969786] Thx for response. it is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927's value is to deal with this problem.



> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test.sql:
> -
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> -
>  
> when execute beeline command: bin/beeline  -f /tmp/test.sql  --hivevar 
> db_name=offline 
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from beeline CLI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread bruce xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937644#comment-15937644
 ] 

bruce xu edited comment on SPARK-19927 at 3/23/17 3:34 AM:
---

[~q79969786] Thx for response. it is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927 deal with this problem.




was (Author: xwc3504):
[~q79969786] Thx for response. it is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work. 

so SPARK-19927 deal with this problem.



> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test.sql:
> -
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> -
>  
> when execute beeline command: bin/beeline  -f /tmp/test.sql  --hivevar 
> db_name=offline 
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from beeline CLI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread bruce xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937644#comment-15937644
 ] 

bruce xu edited comment on SPARK-19927 at 3/23/17 3:35 AM:
---

[~q79969786] Thx for response. it is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927's value is to deal with this problem.




was (Author: xwc3504):
[~q79969786] Thx for response. it is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work 
in spark 2.X. 

so SPARK-19927 deal with this problem.



> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test.sql:
> -
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> -
>  
> when execute beeline command: bin/beeline  -f /tmp/test.sql  --hivevar 
> db_name=offline 
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from beeline CLI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread bruce xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937644#comment-15937644
 ] 

bruce xu edited comment on SPARK-19927 at 3/23/17 3:34 AM:
---

[~q79969786] Thx for response. it is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-13983 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work. 

so SPARK-19927 deal with this problem.




was (Author: xwc3504):
[~q79969786] Thx for response. it is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-1398 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work. 

so SPARK-19927 deal with this problem.



> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test.sql:
> -
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> -
>  
> when execute beeline command: bin/beeline  -f /tmp/test.sql  --hivevar 
> db_name=offline 
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from beeline CLI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread bruce xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bruce xu updated SPARK-19927:
-
Description: 
suppose the content of file test.sql:
-
!connect jdbc:hive2://localhost:1 test test
USE ${hivevar:db_name};
-
 
when execute beeline command: bin/beeline  -f /tmp/test.sql  --hivevar 
db_name=offline 
the output is: 

Error: org.apache.spark.sql.catalyst.parser.ParseException: 
no viable alternative at input ''(line 1, pos 4)

== SQL ==
use 
^^^ (state=,code=0)
-

so the parameter --hivevar can not be read from beeline CLI.


  was:
suppose the content of file test1.sql:
-
USE ${hivevar:db_name};
-
 
when execute command: bin/spark-sql -f /tmp/test.sql  --hivevar db_name=offline
the output is: 

Error: org.apache.spark.sql.catalyst.parser.ParseException: 
no viable alternative at input ''(line 1, pos 4)

== SQL ==
use 
^^^ (state=,code=0)
-

so the parameter --hivevar can not be read from CLI.
the bug still appears with beeline command: bin/beeline  -f /tmp/test2.sql  
--hivevar db_name=offline with test2.sql:

!connect jdbc:hive2://localhost:1 test test
USE ${hivevar:db_name};
--




> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test.sql:
> -
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> -
>  
> when execute beeline command: bin/beeline  -f /tmp/test.sql  --hivevar 
> db_name=offline 
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from beeline CLI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20066) Add explicit SecurityManager(SparkConf) constructor for backwards compatibility with Java

2017-03-22 Thread Mark Grover (JIRA)
Mark Grover created SPARK-20066:
---

 Summary: Add explicit SecurityManager(SparkConf) constructor for 
backwards compatibility with Java
 Key: SPARK-20066
 URL: https://issues.apache.org/jira/browse/SPARK-20066
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1, 2.2.0
Reporter: Mark Grover


SPARK-19520 added an optional argument (ioEncryptionKey) to Security Manager 
class. And, it has a default value, so life is great.

However, that's not enough when invoking the class from Java. We didn't see 
this before because the SecurityManager class is private to the spark package 
and all the code that uses it is Scala.

However, I have some code that was extending it, in Java, and that code breaks 
because Java can't access that default value (more details 
[here|http://stackoverflow.com/questions/13059528/instantiate-a-scala-class-from-java-and-use-the-default-parameters-of-the-const]).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread bruce xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937644#comment-15937644
 ] 

bruce xu commented on SPARK-19927:
--

[~q79969786] Thx for response. it is half right.  reason:

- issue SPARK-19927 derives from SPARK-13983, but SPARK-1398 still not merge 
into master. 

- SPARK-18086 has merged into master. however this issue only resolve 
bin/spark-sql shell interface(code change in SparkSQLCLIDriver)  problem but 
not dealing with bin/beeline interface(without code change in 
SparkSQLOperationManager).

that's why cmd: bin/beeline -f test.sql --hivevar db_name=online can not work. 

so SPARK-19927 deal with this problem.



> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test1.sql:
> -
> USE ${hivevar:db_name};
> -
>  
> when execute command: bin/spark-sql -f /tmp/test.sql  --hivevar 
> db_name=offline
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from CLI.
> the bug still appears with beeline command: bin/beeline  -f /tmp/test2.sql  
> --hivevar db_name=offline with test2.sql:
> 
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> --



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20065) Empty output files created for aggregation query in append mode

2017-03-22 Thread Silvio Fiorito (JIRA)
Silvio Fiorito created SPARK-20065:
--

 Summary: Empty output files created for aggregation query in 
append mode
 Key: SPARK-20065
 URL: https://issues.apache.org/jira/browse/SPARK-20065
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Silvio Fiorito


I've got a Kafka topic which I'm querying, running a windowed aggregation, with 
a 30 second watermark, 10 second trigger, writing out to Parquet with append 
output mode.

Every 10 second trigger generates a file, regardless of whether there was any 
data for that trigger, or whether any records were actually finalized by the 
watermark.

Is this expected behavior or should it not write out these empty files?

{code}
val df = spark.readStream.format("kafka")

val query = df
  .withWatermark("timestamp", "30 seconds")
  .groupBy(window($"timestamp", "10 seconds"))
  .count()
  .select(date_format($"window.start", "HH:mm:ss").as("time"), $"count")

query
  .writeStream
  .format("parquet")
  .option("checkpointLocation", aggChk)
  .trigger(ProcessingTime("10 seconds"))
  .outputMode("append")
  .start(aggPath)
{code}

As the query executes, do a file listing on "aggPath" and you'll see 339 byte 
files at a minimum until we arrive at the first watermark and the initial batch 
is finalized. Even after that though, as there are empty batches it'll keep 
generating empty files every trigger.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-22 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937467#comment-15937467
 ] 

Takeshi Yamamuro commented on SPARK-20009:
--

okay, I'll do it later. Thanks!

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20064) Bump the PySpark verison number to 2.2

2017-03-22 Thread holdenk (JIRA)
holdenk created SPARK-20064:
---

 Summary: Bump the PySpark verison number to 2.2
 Key: SPARK-20064
 URL: https://issues.apache.org/jira/browse/SPARK-20064
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.2.0
Reporter: holdenk
Priority: Minor


The version.py should be updated for the new version. Note: this isn't critical 
since for any releases made with make-distribution the version number is read 
from the xml, but if anyone builds from source and manually looks at the 
version # it would be good to have it match. This is a good starter issue, but 
something we should do quickly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-18970.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

I'm going to close this, but please reopen if you can reproduce on 2.1.1+.

> FileSource failure during file list refresh doesn't cause an application to 
> fail, but stops further processing
> --
>
> Key: SPARK-18970
> URL: https://issues.apache.org/jira/browse/SPARK-18970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.0, 2.0.2
>Reporter: Lev
> Fix For: 2.1.0
>
> Attachments: sparkerror.log
>
>
> Spark streaming application uses S3 files as streaming sources. After running 
> for several day processing stopped even though an application continued to 
> run. 
> Stack trace:
> {code}
> java.io.FileNotFoundException: No such file or directory 
> 's3n://X'
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465)
>   at 
> org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I believe 2 things should (or can) be fixed:
> 1. Application should fail in case of such an error.
> 2. Allow application to ignore such failure, since there is a chance that 
> during next refresh the error will not resurface. (In my case I believe an 
> error was cased by S3 cleaning the bucket exactly at the same moment when 
> refresh was running) 
> My code to create streaming processing looks as the following:
> {code}
>   val cq = sqlContext.readStream
> .format("json")
> .schema(struct)
> .load(s"input")
> .writeStream
> .option("checkpointLocation", s"checkpoints")
> .foreach(new ForeachWriter[Row] {...})
> .trigger(ProcessingTime("10 seconds")).start()
>   
> cq.awaitTermination() 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17344) Kafka 0.8 support for Structured Streaming

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-17344.

Resolution: Won't Fix

Unless someone really wants to work on this, i think the fact that they have 
compatibility for 0.10.0+ is reason enough to close this JIRA.

> Kafka 0.8 support for Structured Streaming
> --
>
> Key: SPARK-17344
> URL: https://issues.apache.org/jira/browse/SPARK-17344
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Frederick Reiss
>
> Design and implement Kafka 0.8-based sources and sinks for Structured 
> Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19965) DataFrame batch reader may fail to infer partitions when reading FileStreamSink's output

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19965:
-
Target Version/s: 2.2.0

> DataFrame batch reader may fail to infer partitions when reading 
> FileStreamSink's output
> 
>
> Key: SPARK-19965
> URL: https://issues.apache.org/jira/browse/SPARK-19965
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>
> Reproducer
> {code}
>   test("partitioned writing and batch reading with 'basePath'") {
> val inputData = MemoryStream[Int]
> val ds = inputData.toDS()
> val outputDir = Utils.createTempDir(namePrefix = 
> "stream.output").getCanonicalPath
> val checkpointDir = Utils.createTempDir(namePrefix = 
> "stream.checkpoint").getCanonicalPath
> var query: StreamingQuery = null
> try {
>   query =
> ds.map(i => (i, i * 1000))
>   .toDF("id", "value")
>   .writeStream
>   .partitionBy("id")
>   .option("checkpointLocation", checkpointDir)
>   .format("parquet")
>   .start(outputDir)
>   inputData.addData(1, 2, 3)
>   failAfter(streamingTimeout) {
> query.processAllAvailable()
>   }
>   spark.read.option("basePath", outputDir).parquet(outputDir + 
> "/*").show()
> } finally {
>   if (query != null) {
> query.stop()
>   }
> }
>   }
> {code}
> Stack trace
> {code}
> [info] - partitioned writing and batch reading with 'basePath' *** FAILED *** 
> (3 seconds, 928 milliseconds)
> [info]   java.lang.AssertionError: assertion failed: Conflicting directory 
> structures detected. Suspicious paths:
> [info]***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637
> [info]
> ***/stream.output-65e3fa45-595a-4d29-b3df-4c001e321637/_spark_metadata
> [info] 
> [info] If provided paths are partition directories, please set "basePath" in 
> the options of the data source to specify the root directory of the table. If 
> there are multiple root directories, please load them separately and then 
> union them.
> [info]   at scala.Predef$.assert(Predef.scala:170)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:156)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:54)
> [info]   at 
> org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:536)
> [info]   at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:520)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply$mcV$sp(FileStreamSinkSuite.scala:292)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> [info]   at 
> org.apache.spark.sql.streaming.FileStreamSinkSuite$$anonfun$8.apply(FileStreamSinkSuite.scala:268)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19767:
-
Component/s: (was: Structured Streaming)
 DStreams

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19013) java.util.ConcurrentModificationException when using s3 path as checkpointLocation

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-19013.
--
Resolution: Later

It seems like [HADOOP-13345] is the right solution here, but since this is 
outside of the scope of things we can fix in Spark, I'm going to close this 
ticket to keep the backlog clear.

> java.util.ConcurrentModificationException when using s3 path as 
> checkpointLocation 
> ---
>
> Key: SPARK-19013
> URL: https://issues.apache.org/jira/browse/SPARK-19013
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tim Chan
>
> I have a structured stream job running on EMR. The job will fail due to this
> {code}
> Multiple HDFSMetadataLog are using s3://mybucket/myapp 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLog.org$apache$spark$sql$execution$streaming$HDFSMetadataLog$$writeBatch(HDFSMetadataLog.scala:162)
> {code}
> There is only one instance of this stream job running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20008:


Assignee: Xiao Li  (was: Apache Spark)

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>Assignee: Xiao Li
>Priority: Minor
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20008:


Assignee: Apache Spark  (was: Xiao Li)

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>Assignee: Apache Spark
>Priority: Minor
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937430#comment-15937430
 ] 

Apache Spark commented on SPARK-20008:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/17392

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>Assignee: Xiao Li
>Priority: Minor
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19788) DataStreamReader/DataStreamWriter.option shall accept user-defined type

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-19788.
--
Resolution: Won't Fix

Thanks for the suggestion.  However, as [~zsxwing] said, the goal here is a 
small, cross-language compatible API that is the same as the batch version.  I 
think it totally reasonable for specific source to produce typesafe bindings on 
top of this API.  (look at spark-avro for an example)

> DataStreamReader/DataStreamWriter.option shall accept user-defined type
> ---
>
> Key: SPARK-19788
> URL: https://issues.apache.org/jira/browse/SPARK-19788
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nan Zhu
>
> There are many other data sources/sinks which has very different 
> configuration ways than Kafka, FileSystem, etc. 
> The expected type of the configuration entry passed to them might be a nested 
> collection type, e.g. Map[String, Map[String, String]], or even a 
> user-defined type(for example, the one I am working on)
> Right now, option can only accept String -> String/Boolean/Long/Double OR a 
> complete Map[String, String]...my suggestion is that we can accept 
> Map[String, Any], and the type of 'parameters' in SourceProvider.createSource 
> can also be Map[String, Any], this will create much more flexibility to the 
> user
> The drawback is that, it is a breaking change ( we can mitigate this by 
> deprecating the current one, and progressively evolve to the new one if the 
> proposal is accepted)
> [~zsxwing] what do you think?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19932) Disallow a case that might cause OOM for steaming deduplication

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-19932.
--
Resolution: Won't Fix

Thanks for working on this.  While I think it would be helpful to come up with 
a full proposal to help users understand which of their queries might result in 
unscalable amounts of state, I don't think we should do it piecemeal in this 
way.

> Disallow a case that might cause OOM for steaming deduplication
> ---
>
> Key: SPARK-19932
> URL: https://issues.apache.org/jira/browse/SPARK-19932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Liwei Lin
>
> {code}
> spark
>.readStream // schema: (word, eventTime), like ("a", 10), 
> ("a", 11), ("b", 12) ...
>...
>.withWatermark("eventTime", "10 seconds")
>.dropDuplicates("word") // note: "eventTime" is not part of the key 
> columns
>...
> {code}
> As shown above, right now if watermark is specified for a streaming 
> dropDuplicates query, but not specified as the key columns, then we'll still 
> get the correct answer, but the state just keeps growing and will never get 
> cleaned up.
> The reason is, the watermark attribute is not part of the key of the state 
> store in this case. We're not saving event time information in the state 
> store.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19876) Add OneTime trigger executor

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-19876:


Assignee: Tyson Condie

> Add OneTime trigger executor
> 
>
> Key: SPARK-19876
> URL: https://issues.apache.org/jira/browse/SPARK-19876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Tyson Condie
>
> The goal is to add a new trigger executor that will process a single trigger 
> then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19876) Add OneTime trigger executor

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19876:
-
Target Version/s: 2.2.0

> Add OneTime trigger executor
> 
>
> Key: SPARK-19876
> URL: https://issues.apache.org/jira/browse/SPARK-19876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>
> The goal is to add a new trigger executor that will process a single trigger 
> then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19989:
-
Description: 
This test failed recently here: 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/stress_test_with_multiple_topics_and_partitions/

And based on Josh's dashboard 
(https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressSuite_name=stress+test+with+multiple+topics+and+partitions),
 seems to fail a few times every month.  Here's the full error from the most 
recent failure:

Error Message
{code}
org.scalatest.exceptions.TestFailedException:  Error adding data: replication 
factor: 1 larger than available brokers: 0 
kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)  
kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)  
org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)
  
org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)
  
org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)
  
org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)
  
org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
  scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)  
org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)
  
org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)
{code}

{code}
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
Error adding data: replication factor: 1 larger than available brokers: 0
kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)
kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)

org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)

org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)

org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)

org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)

org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)

org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)

org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)


== Progress ==
   AssertOnQuery(, )
   CheckAnswer: 
   StopStream
   
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@5d888be0,Map())
   AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data 
= Range(0, 1, 2, 3, 4, 5, 6, 7, 8), message = )
   CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8],[9]
   StopStream
   
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1be724ee,Map())
   AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data 
= Range(9, 10, 11, 12, 13, 14), message = )
   CheckAnswer: 
[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15]
   StopStream
   AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data 
= Range(), message = )
=> AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
stress3), data = Range(15), message = Add topic stress7)
   AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
stress3), data = Range(16, 17, 18, 19, 20, 21, 22), message = Add partition)
   AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
stress3), data = Range(23, 24), message = Add partition)
   AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
stress5, stress3), data = Range(), message = Add topic stress9)
   AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
stress5, stress3), data = Range(25, 26, 27, 28, 29, 30, 31, 32, 33), message = )
   AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
stress5, stress3), data = Range(), message = )
   AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
stress5, stress3), data = Range(), message = )
   AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
stress5, stress3), data = Range(34, 35, 36, 37, 38, 39), message = )
   AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, 

[jira] [Updated] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite

2017-03-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19989:
-
Target Version/s: 2.2.0

> Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite
> 
>
> Key: SPARK-19989
> URL: https://issues.apache.org/jira/browse/SPARK-19989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>Priority: Minor
>  Labels: flaky-test
>
> This test failed recently here: 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/stress_test_with_multiple_topics_and_partitions/
> And based on Josh's dashboard 
> (https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressSuite_name=stress+test+with+multiple+topics+and+partitions),
>  seems to fail a few times every month.  Here's the full error from the most 
> recent failure:
> Error Message
> {code}
> org.scalatest.exceptions.TestFailedException:  Error adding data: replication 
> factor: 1 larger than available brokers: 0 
> kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)  
> kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)  
> org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
>   scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)  
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)
> {code}
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> Error adding data: replication factor: 1 larger than available brokers: 0
> kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117)
>   kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403)
>   
> org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903)
>   
> org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92)
>   scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316)
>   
> org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92)
>   
> org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494)
> == Progress ==
>AssertOnQuery(, )
>CheckAnswer: 
>StopStream
>
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@5d888be0,Map())
>AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range(0, 1, 2, 3, 4, 5, 6, 7, 8), message = )
>CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8],[9]
>StopStream
>
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1be724ee,Map())
>AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range(9, 10, 11, 12, 13, 14), message = )
>CheckAnswer: 
> [1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15]
>StopStream
>AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), 
> data = Range(), message = )
> => AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range(15), message = Add topic stress7)
>AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range(16, 17, 18, 19, 20, 21, 22), message = Add partition)
>AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, 
> stress3), data = Range(23, 24), message = Add partition)
>AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, 
> stress5, stress3), data = Range(), message = Add 

[jira] [Commented] (SPARK-18085) Better History Server scalability for many / large applications

2017-03-22 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937377#comment-15937377
 ] 

Marcelo Vanzin commented on SPARK-18085:


Hi all, an update. I started working on this again and I was concerned about my 
original approach; fixing conflicts would be hard and it would be easy to miss 
changes in the UI while the work is going on. So I changed a bit the way I'm 
working to make integration a little more gradual.

Basically, instead of forking UI code, I'm moving the old code into the new 
module and making changes to it so tests work. So while this work is going on, 
there's a "franken ui" where part of the UI is using the old backend, and part 
of the UI is using the new backend. This will save time in the end, and will 
catch errors in the new code sooner (as I found out when trying to fix all the 
unit tests).

So I updated all the branches with the new approach:
https://github.com/vanzin/spark/tree/shs-ng/M4.4
https://github.com/vanzin/spark/tree/shs-ng/M4.3
https://github.com/vanzin/spark/tree/shs-ng/M4.2
https://github.com/vanzin/spark/tree/shs-ng/M4.1
https://github.com/vanzin/spark/tree/shs-ng/M4.0
https://github.com/vanzin/spark/tree/shs-ng/M3
https://github.com/vanzin/spark/tree/shs-ng/M2
https://github.com/vanzin/spark/tree/shs-ng/M1

There's a new branch (M4.0) where I do some initial integration to allow the 
"franken ui" to be used. And each M4 branch contains more changes than before 
because I had to make all existing unit tests pass after the data was served 
from the backend. Current state is that all "core" pages are using the new 
backend and pass unit tests.

Next I'll be re-basing everything to current master, and then I'll start to add 
more things and clean up more old code that's not needed anymore.


> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20063) Trigger without delay when falling behind

2017-03-22 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-20063:


 Summary: Trigger without delay when falling behind 
 Key: SPARK-20063
 URL: https://issues.apache.org/jira/browse/SPARK-20063
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Michael Armbrust
Priority: Critical


Today, when we miss a trigger interval we wait until the next one to fire.  
However, for real workloads this usually means that you fall further and 
further behind by sitting idle while waiting.  We should revisit this decision.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20062) Inconsistent checking on ML estimator/model copy in the unit tests.

2017-03-22 Thread yuhao yang (JIRA)
yuhao yang created SPARK-20062:
--

 Summary: Inconsistent checking on ML estimator/model copy in the 
unit tests.
 Key: SPARK-20062
 URL: https://issues.apache.org/jira/browse/SPARK-20062
 Project: Spark
  Issue Type: Test
  Components: ML
Affects Versions: 2.1.0
Reporter: yuhao yang
Priority: Minor


Currently {code} 
// copied model must have the same parent.
MLTestingUtils.checkCopy(model)
 {code}
is missing from many unit tests in ml (only 6 appearances in ml.feature). And 
even for those with the check, we don't have a consistent place for them and 
they got scattered among different unit tests. Perhaps that's the reason that 
the check is often forgotten for new features.

Possible options:
1. put it together with save/load
2. put it in the default parameter check (but not all algorithms has this 
check).






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2017-03-22 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937356#comment-15937356
 ] 

Alex Bozarth commented on SPARK-20044:
--

I took a look at your link and it looks like it's on the right path, it seems 
you still need to go through and make sure it completely solves the problem. If 
you open up a PR I'll take a more detailed look. This addition actually solves 
an issue I had with the original pr, I was always worried that 
`spark.iu.reverseProxyUrl` was only used in one place and didn't have to be 
correct in many use-cases, this addition seems to leverage it across the UI to 
solve your issue.

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Oliver Koeth
>Priority: Minor
>  Labels: reverse-proxy, sso
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20037) impossible to set kafka offsets using kafka 0.10 and spark 2.0.0

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20037:
--
Target Version/s:   (was: 2.0.0)
Priority: Major  (was: Critical)
   Fix Version/s: (was: 2.0.3)

> impossible to set kafka offsets using kafka 0.10 and spark 2.0.0
> 
>
> Key: SPARK-20037
> URL: https://issues.apache.org/jira/browse/SPARK-20037
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the a topic starting with offsets. 
> The topic has 4 partitions that start somewhere before 585000 and end after 
> 674000. So I wanted to read all partitions starting with 585000
> fromOffsets.put(new TopicPartition(topic, 0), 585000L);
> fromOffsets.put(new TopicPartition(topic, 1), 585000L);
> fromOffsets.put(new TopicPartition(topic, 2), 585000L);
> fromOffsets.put(new TopicPartition(topic, 3), 585000L);
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> The code immediately throws:
> Beginning offset 585000 is after the ending offset 584464 for topic 
> commerce_item_expectation partition 1
> It does not make sense because this topic/partition starts at 584464, not ends
> I use this as a base: 
> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> But I use direct stream:
> KafkaUtils.createDirectStream(jssc,LocationStrategies.PreferConsistent(),
> ConsumerStrategies.Subscribe(
> topics, kafkaParams, fromOffsets
> )
> )



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19876) Add OneTime trigger executor

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19876:
--
Fix Version/s: (was: 2.2.0)

> Add OneTime trigger executor
> 
>
> Key: SPARK-19876
> URL: https://issues.apache.org/jira/browse/SPARK-19876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>
> The goal is to add a new trigger executor that will process a single trigger 
> then stop. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20036) impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20036:
--
Target Version/s:   (was: 2.0.0)
Priority: Major  (was: Critical)
   Fix Version/s: (was: 2.0.3)

Please read http://spark.apache.org/contributing.html before opening a JIRA. I 
don't think this is nearly enough info to reproduce or understand the problem 
you're reporting.

> impossible to read a whole kafka topic using kafka 0.10 and spark 2.0.0 
> 
>
> Key: SPARK-20036
> URL: https://issues.apache.org/jira/browse/SPARK-20036
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: Daniel Nuriyev
>
> I use kafka 0.10.1 and java code with the following dependencies:
> 
> org.apache.kafka
> kafka_2.11
> 0.10.1.1
> 
> 
> org.apache.kafka
> kafka-clients
> 0.10.1.1
> 
> 
> org.apache.spark
> spark-streaming_2.11
> 2.0.0
> 
> 
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.0
> 
> The code tries to read the whole topic using:
> kafkaParams.put("auto.offset.reset", "earliest");
> Using 5 second batches:
> jssc = new JavaStreamingContext(conf, Durations.seconds(5));
> Each batch returns empty.
> I debugged the code I noticed that KafkaUtils.fixKafkaParams is called that 
> overrides "earliest" with "none".
> Whether this is related or not, when I used kafka 0.8 on the client with 
> kafka 0.10.1 on the server, I could read the whole topic.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accept

2017-03-22 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937269#comment-15937269
 ] 

yuhao yang edited comment on SPARK-20043 at 3/22/17 10:25 PM:
--

Looks like a bug for tree models load. a toLower should be added when loading 
impurityType from metadata. 
Ideally, we should also check for potential issues like this in other 
algorithms.


was (Author: yuhaoyan):
Looks like a bug for tree models load. a toLower should be added when loading 
impurityType from metadata. 

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-22 Thread Nick Pentreath (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-20043:
---
Labels: starter  (was: )

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>  Labels: starter
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted

2017-03-22 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937269#comment-15937269
 ] 

yuhao yang commented on SPARK-20043:


Looks like a bug for tree models load. a toLower should be added when loading 
impurityType from metadata. 

> CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" 
> on ML random forest and decision. Only "gini" and "entropy" (in lower case) 
> are accepted
> 
>
> Key: SPARK-20043
> URL: https://issues.apache.org/jira/browse/SPARK-20043
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Zied Sellami
>
> I saved a CrossValidatorModel with a decision tree and a random forest. I use 
> Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not 
> able to load the saved model, when impurity are written not in lowercase. I 
> obtain an error from Spark "impurity Gini (Entropy) not recognized.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19613) Flaky test: StateStoreRDDSuite.versioning and immutability

2017-03-22 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19613.

Resolution: Cannot Reproduce

I'm closing this because, while it had a burst of failures about a month ago 
(see here: 
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite=versioning+and+immutability)
 it hasn't failed since.

> Flaky test: StateStoreRDDSuite.versioning and immutability
> --
>
> Key: SPARK-19613
> URL: https://issues.apache.org/jira/browse/SPARK-19613
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Tests
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Priority: Minor
>
> This test: 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite.versioning 
> and immutability failed on a recent PR: 
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72948/testReport/junit/org.apache.spark.sql.execution.streaming.state/StateStoreRDDSuite/versioning_and_immutability/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19612) Tests failing with timeout

2017-03-22 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-19612.

Resolution: Cannot Reproduce

Closing this for now because I haven't seen this issue in a while (we can 
re-open if this starts occurring again)

> Tests failing with timeout
> --
>
> Key: SPARK-19612
> URL: https://issues.apache.org/jira/browse/SPARK-19612
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.1
>Reporter: Kay Ousterhout
>Priority: Minor
>
> I've seen at least one recent test failure due to hitting the 250m timeout: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/
> Filing this JIRA to track this; if it happens repeatedly we should up the 
> timeout.
> cc [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13747) Concurrent execution in SQL doesn't work with Scala ForkJoinPool

2017-03-22 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936997#comment-15936997
 ] 

Barry Becker commented on SPARK-13747:
--

We have hit this on rare instances in our production environment when calling 
tableNames on SQLContext. We are using Spark 2.1.0. Are there any possible 
workarounds that we might try? What is the ETA of spark 2.2?

{code}
spark.sql.execution.id is already set
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:81)
org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2375)
org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2375)
org.apache.spark.sql.Dataset.collect(Dataset.scala:2351)
org.apache.spark.sql.SQLContext.tableNames(SQLContext.scala:750)
{code}

> Concurrent execution in SQL doesn't work with Scala ForkJoinPool
> 
>
> Key: SPARK-13747
> URL: https://issues.apache.org/jira/browse/SPARK-13747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.2.0
>
>
> Run the following codes may fail
> {code}
> (1 to 100).par.foreach { _ =>
>   println(sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count())
> }
> java.lang.IllegalArgumentException: spark.sql.execution.id is already set 
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>  
> at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) 
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) 
> {code}
> This is because SparkContext.runJob can be suspended when using a 
> ForkJoinPool (e.g.,scala.concurrent.ExecutionContext.Implicits.global) as it 
> calls Await.ready (introduced by https://github.com/apache/spark/pull/9264).
> So when SparkContext.runJob is suspended, ForkJoinPool will run another task 
> in the same thread, however, the local properties has been polluted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20057) Renamed KeyedState to GroupState

2017-03-22 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-20057.
---
Resolution: Fixed

Issue resolved by pull request 17385
[https://github.com/apache/spark/pull/17385]

> Renamed KeyedState to GroupState
> 
>
> Key: SPARK-20057
> URL: https://issues.apache.org/jira/browse/SPARK-20057
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.2.0
>
>
> Since the state is tied a "group" in the "mapGroupsWithState" operations, its 
> better to call the state "GroupState" instead of a key. This would make it 
> more general if you extends this operation to RelationGroupedDataset and 
> python APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20040) Python API for ml.stat.ChiSquareTest

2017-03-22 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936905#comment-15936905
 ] 

Joseph K. Bradley commented on SPARK-20040:
---

Sure, go ahead, thanks!

> Python API for ml.stat.ChiSquareTest
> 
>
> Key: SPARK-20040
> URL: https://issues.apache.org/jira/browse/SPARK-20040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> Add PySpark wrapper for ChiSquareTest.  Note that it's currently called 
> ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20040) Python API for ml.stat.ChiSquareTest

2017-03-22 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936902#comment-15936902
 ] 

Bago Amirbekian commented on SPARK-20040:
-

I'd like to work on this.

> Python API for ml.stat.ChiSquareTest
> 
>
> Key: SPARK-20040
> URL: https://issues.apache.org/jira/browse/SPARK-20040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> Add PySpark wrapper for ChiSquareTest.  Note that it's currently called 
> ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20009) Use user-friendly DDL formats for defining a schema in user-facing APIs

2017-03-22 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936876#comment-15936876
 ] 

Michael Armbrust commented on SPARK-20009:
--

Yeah, the DDL format is certainly a lot easier to type than the JSON.  I think 
it makes sense to support both if we can tell the difference unambiguously 
(which I think we can).

> Use user-friendly DDL formats for defining a schema  in user-facing APIs
> 
>
> Key: SPARK-20009
> URL: https://issues.apache.org/jira/browse/SPARK-20009
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>
> In https://issues.apache.org/jira/browse/SPARK-19830, we add a new API in the 
> DDL parser to convert a DDL string into a schema. Then, we can use DDL 
> formats in existing some APIs, e.g., functions.from_json 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3062.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20008:

Priority: Minor  (was: Major)

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>Assignee: Xiao Li
>Priority: Minor
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-22 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936847#comment-15936847
 ] 

Xiao Li commented on SPARK-20008:
-

This sounds a general issue for our Spark SQL. For example, 
{{spark.emptyDataFrame.distinct()}} also returns a non empty result set. 

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>Assignee: Xiao Li
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20054) [Mesos] Detectability for resource starvation

2017-03-22 Thread Kamal Gurala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936759#comment-15936759
 ] 

Kamal Gurala commented on SPARK-20054:
--

Yes, the logs do help detect the issue. 
Do you think having  a new config option that gives resources back to the 
cluster if `spark.scheduler.minRegisteredResourcesRatio` is not met after a 
certain amount of configurable amount of time, would be of interest ?

> [Mesos] Detectability for resource starvation
> -
>
> Key: SPARK-20054
> URL: https://issues.apache.org/jira/browse/SPARK-20054
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos, Scheduler
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Kamal Gurala
>Priority: Minor
>
> We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We 
> had a production issue recently wherein we had our spark frameworks accept 
> resources from the Mesos master, so executors were started and spark driver 
> was aware of them, but the driver didn’t plan any task and nothing was 
> happening for a long time because it didn't meet a minimum registered 
> resources threshold. and the cluster is usually under-provisioned in order 
> because not all the jobs need to run at the same time. These held resources 
> were never offered back to the master for re-allocation leading to the entire 
> cluster to a halt until we had to manually intervene. 
> Using DRF for mesos and FIFO for Spark and the cluster is usually 
> under-provisioned. At any point of time there could be 10-15 spark frameworks 
> running on Mesos on the under-provisioned cluster 
> The ask is to have a way to better recoverability or detectability for a 
> scenario where the individual Spark frameworks hold onto resources but never 
> launch any tasks or have these frameworks release these resources after a 
> fixed amount of time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17204) Spark 2.0 off heap RDD persistence with replication factor 2 leads to in-memory data corruption

2017-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936733#comment-15936733
 ] 

Apache Spark commented on SPARK-17204:
--

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/17390

> Spark 2.0 off heap RDD persistence with replication factor 2 leads to 
> in-memory data corruption
> ---
>
> Key: SPARK-17204
> URL: https://issues.apache.org/jira/browse/SPARK-17204
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Michael Allman
>Assignee: Michael Allman
> Fix For: 2.1.1, 2.2.0
>
>
> We use the {{OFF_HEAP}} storage level extensively with great success. We've 
> tried off-heap storage with replication factor 2 and have always received 
> exceptions on the executor side very shortly after starting the job. For 
> example:
> {code}
> com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 
> 9086
>   at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
>   at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> or
> {code}
> java.lang.IndexOutOfBoundsException: Index: 6, Size: 0
>   at java.util.ArrayList.rangeCheck(ArrayList.java:653)
>   at java.util.ArrayList.get(ArrayList.java:429)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:60)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:834)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:788)
>   at 
> org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificColumnarIterator.hasNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> 

[jira] [Resolved] (SPARK-20018) Pivot with timestamp and count should not print internal representation

2017-03-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20018.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 2.2.0

> Pivot with timestamp and count should not print internal representation
> ---
>
> Key: SPARK-20018
> URL: https://issues.apache.org/jira/browse/SPARK-20018
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, when we perform count with timestamp types, it prints the internal 
> representation as the column name as below:
> {code}
> scala> Seq(new 
> java.sql.Timestamp(1)).toDF("a").groupBy("a").pivot("a").count().show()
> +++
> |   a|1000|
> +++
> |1969-12-31 16:00:...|   1|
> +++
> {code}
> It seems this should be 
> {code}
> ++---+
> |   a|1969-12-31 16:00:00.001|
> ++---+
> |1969-12-31 16:00:...|  1|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19927.
---
Resolution: Duplicate

> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test1.sql:
> -
> USE ${hivevar:db_name};
> -
>  
> when execute command: bin/spark-sql -f /tmp/test.sql  --hivevar 
> db_name=offline
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from CLI.
> the bug still appears with beeline command: bin/beeline  -f /tmp/test2.sql  
> --hivevar db_name=offline with test2.sql:
> 
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> --



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-03-22 Thread Andrey Yakovenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936623#comment-15936623
 ] 

Andrey Yakovenko commented on SPARK-19984:
--

Unfortunately i cannot provide code since company rules forbid this. I also 
cannot extract exact part of code where this happened

> ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-19984
> URL: https://issues.apache.org/jira/browse/SPARK-19984
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Andrey Yakovenko
>
> I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. 
> This is not permanent error, next time i run it could disappear. 
> Unfortunately i don't know how to reproduce the issue.  As you can see from 
> the log my logic is pretty complicated.
> Here is a part of log i've got (container_1489514660953_0015_01_01)
> {code}
> 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 151, Column 29: A method named "compare" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator[] inputs;
> /* 008 */   private boolean agg_initAgg;
> /* 009 */   private boolean agg_bufIsNull;
> /* 010 */   private long agg_bufValue;
> /* 011 */   private boolean agg_initAgg1;
> /* 012 */   private boolean agg_bufIsNull1;
> /* 013 */   private long agg_bufValue1;
> /* 014 */   private scala.collection.Iterator smj_leftInput;
> /* 015 */   private scala.collection.Iterator smj_rightInput;
> /* 016 */   private InternalRow smj_leftRow;
> /* 017 */   private InternalRow smj_rightRow;
> /* 018 */   private UTF8String smj_value2;
> /* 019 */   private java.util.ArrayList smj_matches;
> /* 020 */   private UTF8String smj_value3;
> /* 021 */   private UTF8String smj_value4;
> /* 022 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> smj_numOutputRows;
> /* 023 */   private UnsafeRow smj_result;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> smj_rowWriter;
> /* 026 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows;
> /* 027 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime;
> /* 028 */   private UnsafeRow agg_result;
> /* 029 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
> /* 030 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter;
> /* 031 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows1;
> /* 032 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime1;
> /* 033 */   private UnsafeRow agg_result1;
> /* 034 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder1;
> /* 035 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter1;
> /* 036 */
> /* 037 */   public GeneratedIterator(Object[] references) {
> /* 038 */ this.references = references;
> /* 039 */   }
> /* 040 */
> /* 041 */   public void init(int index, scala.collection.Iterator[] inputs) {
> /* 042 */ partitionIndex = index;
> /* 043 */ this.inputs = inputs;
> /* 044 */ wholestagecodegen_init_0();
> /* 045 */ wholestagecodegen_init_1();
> /* 046 */
> /* 047 */   }
> /* 048 */
> /* 049 */   private void wholestagecodegen_init_0() {
> /* 050 */ agg_initAgg = false;
> /* 051 */
> /* 052 */ agg_initAgg1 = false;
> /* 053 */
> /* 054 */ smj_leftInput = inputs[0];
> /* 055 */ smj_rightInput = inputs[1];
> /* 056 */
> /* 057 */ smj_rightRow = null;
> /* 058 */
> /* 059 */ smj_matches = new java.util.ArrayList();
> /* 060 */
> /* 061 */ this.smj_numOutputRows = 
> (org.apache.spark.sql.execution.metric.SQLMetric) references[0];
> /* 062 */ smj_result = new UnsafeRow(2);
> /* 063 */ this.smj_holder = new 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(smj_result, 
> 64);
> /* 064 */ this.smj_rowWriter = new 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(smj_holder, 
> 2);
> /* 065 */ 

[jira] [Commented] (SPARK-19927) SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1

2017-03-22 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936601#comment-15936601
 ] 

Yuming Wang commented on SPARK-19927:
-

Is this duplicated by 
[SPARK-13983|https://issues.apache.org/jira/browse/SPARK-13983]?

> SparkThriftServer2 can not get ''--hivevar" variables in spark 2.1
> --
>
> Key: SPARK-19927
> URL: https://issues.apache.org/jira/browse/SPARK-19927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: CentOS 6.5,spark 2.1 build with mvn -Pyarn -Phadoop-2.6 
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
>Reporter: bruce xu
>
> suppose the content of file test1.sql:
> -
> USE ${hivevar:db_name};
> -
>  
> when execute command: bin/spark-sql -f /tmp/test.sql  --hivevar 
> db_name=offline
> the output is: 
> 
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> no viable alternative at input ''(line 1, pos 4)
> == SQL ==
> use 
> ^^^ (state=,code=0)
> -
> so the parameter --hivevar can not be read from CLI.
> the bug still appears with beeline command: bin/beeline  -f /tmp/test2.sql  
> --hivevar db_name=offline with test2.sql:
> 
> !connect jdbc:hive2://localhost:1 test test
> USE ${hivevar:db_name};
> --



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19837) Fetch failure throws a SparkException in SparkHiveWriter

2017-03-22 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936597#comment-15936597
 ] 

Imran Rashid commented on SPARK-19837:
--

I think this is addressed by SPARK-19276, which handles the main problem here.  
We should clean up the exception handling to avoid encapsulating fetch 
failures, though.

> Fetch failure throws a SparkException in SparkHiveWriter
> 
>
> Key: SPARK-19837
> URL: https://issues.apache.org/jira/browse/SPARK-19837
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Sital Kedia
>
> Currently Fetchfailure in SparkHiveWriter fails the job with following 
> exception
> {code}
> 0_0): org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:385)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Connection reset by 
> peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
> at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
> at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:343)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-20019) spark can not load alluxio fileSystem after adding jar

2017-03-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936548#comment-15936548
 ] 

Sean Owen commented on SPARK-20019:
---

I don't know, because I'm not sure this is supposed to work the way you are 
using it. --jars seems like the right-er thing to do.

> spark can not load alluxio fileSystem after adding jar
> --
>
> Key: SPARK-20019
> URL: https://issues.apache.org/jira/browse/SPARK-20019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: roncenzhao
> Attachments: exception_stack.png
>
>
> The follwing sql cannot load alluxioSystem and it throws 
> `ClassNotFoundException`.
> ```
> add jar /xxx/xxx/alluxioxxx.jar;
> set fs.alluxio.impl=alluxio.hadoop.FileSystem;
> select * from alluxionTbl;
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20019) spark can not load alluxio fileSystem after adding jar

2017-03-22 Thread roncenzhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936541#comment-15936541
 ] 

roncenzhao commented on SPARK-20019:


[~srowen] Should I create a PR for this problem?

> spark can not load alluxio fileSystem after adding jar
> --
>
> Key: SPARK-20019
> URL: https://issues.apache.org/jira/browse/SPARK-20019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: roncenzhao
> Attachments: exception_stack.png
>
>
> The follwing sql cannot load alluxioSystem and it throws 
> `ClassNotFoundException`.
> ```
> add jar /xxx/xxx/alluxioxxx.jar;
> set fs.alluxio.impl=alluxio.hadoop.FileSystem;
> select * from alluxionTbl;
> ```



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-22 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936516#comment-15936516
 ] 

Xiao Li commented on SPARK-20008:
-

Sure, will do. 

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1

2017-03-22 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-20008:
---

Assignee: Xiao Li

> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 
> 1
> ---
>
> Key: SPARK-20008
> URL: https://issues.apache.org/jira/browse/SPARK-20008
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
>Reporter: Ravindra Bajpai
>Assignee: Xiao Li
>
> hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 
> 1 against expected 0.
> This was not the case with spark 1.5.2. This is an api change from usage 
> point of view and hence I consider this as a bug. May be a boundary case, not 
> sure.
> Work around - For now I check the counts != 0 before this operation. Not good 
> for performance. Hence creating a jira to track it.
> As Young Zhang explained in reply to my mail - 
> Starting from Spark 2, these kind of operation are implemented in left anti 
> join, instead of using RDD operation directly.
> Same issue also on sqlContext.
> scala> spark.version
> res25: String = 2.0.2
> spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[], output=[])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[], output=[])
>   +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
>  :- Scan ExistingRDD[]
>  +- BroadcastExchange IdentityBroadcastMode
> +- Scan ExistingRDD[]
> This arguably means a bug. But my guess is liking the logic of comparing NULL 
> = NULL, should it return true or false, causing this kind of confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2017-03-22 Thread Oliver Koeth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936501#comment-15936501
 ] 

Oliver Koeth commented on SPARK-15487:
--

Seems the follow-up issue was never opened. I created [SPARK-20044] to address 
the problems with running behind a site proxy as  www.mydomain.com/spark

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20061) Reading a file with colon (:) from S3 fails with URISyntaxException

2017-03-22 Thread Michel Lemay (JIRA)
Michel Lemay created SPARK-20061:


 Summary: Reading a file with colon (:) from S3 fails with 
URISyntaxException
 Key: SPARK-20061
 URL: https://issues.apache.org/jira/browse/SPARK-20061
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0
 Environment: EC2, AWS
Reporter: Michel Lemay


When reading a bunch of files from s3 using wildcards, it fails with the 
following exception:

{code}
scala> val fn = "s3a://mybucket/path/*/"
scala> val ds = spark.readStream.schema(schema).json(fn)

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: 
2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:93)
  at org.apache.hadoop.fs.Globber.glob(Globber.java:241)
  at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1657)
  at org.apache.spark.deploy.SparkHadoopUtil.globPath(SparkHadoopUtil.scala:237)
  at 
org.apache.spark.deploy.SparkHadoopUtil.globPathIfNecessary(SparkHadoopUtil.scala:243)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:131)
  at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$2.apply(DataSource.scala:127)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at 
org.apache.spark.sql.execution.datasources.DataSource.tempFileIndex$lzycompute$1(DataSource.scala:127)
  at 
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$tempFileIndex$1(DataSource.scala:124)
  at 
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:138)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:229)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
  at 
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
  at 
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
  at 
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:133)
  at 
org.apache.spark.sql.streaming.DataStreamReader.json(DataStreamReader.scala:181)
  ... 50 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: 
2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 73 more
{code}

The file in question sits at the root of s3a://mybucket/path/

{code}
aws s3 ls s3://mybucket/path/

   PRE subfolder1/
   PRE subfolder2/
...
2017-01-06 20:33:46   1383 
2017-01-06T20:33:45.255-analyticsqa-49569270507599054034141623773442922465540524816321216514.json
...
{code}


Removing the wildcard from path make it work but it obviously does misses all 
files in subdirectories.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20044) Support Spark UI behind front-end reverse proxy using a path prefix

2017-03-22 Thread Oliver Koeth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936482#comment-15936482
 ] 

Oliver Koeth commented on SPARK-20044:
--

I tried a few (actually 5) experimental changes, see 
https://github.com/okoethibm/spark/commit/cf889c75be0db938c91695046aa297558217c2c3
With just this, I got the spark UI to run behind nginx + a path prefix, and all 
the UI links that I tried (master, worker and running app) worked fine.
I probably still missed some places that need adjusting, but it does not seem 
like the improvement requires lots of modifications all over the place.

> Support Spark UI behind front-end reverse proxy using a path prefix
> ---
>
> Key: SPARK-20044
> URL: https://issues.apache.org/jira/browse/SPARK-20044
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: Oliver Koeth
>Priority: Minor
>  Labels: reverse-proxy, sso
>
> Purpose: allow to run the Spark web UI behind a reverse proxy with URLs 
> prefixed by a context root, like www.mydomain.com/spark. In particular, this 
> allows to access multiple Spark clusters through the same virtual host, only 
> distinguishing them by context root, like www.mydomain.com/cluster1, 
> www.mydomain.com/cluster2, and it allows to run the Spark UI in a common 
> cookie domain (for SSO) with other services.
> [SPARK-15487] introduced some support for front-end reverse proxies by 
> allowing all Spark UI requests to be routed through the master UI as a single 
> endpoint and also added a spark.ui.reverseProxyUrl setting to define a 
> another proxy sitting in front of Spark. However, as noted in the comments on 
> [SPARK-15487], this mechanism does not currently work if the reverseProxyUrl 
> includes a context root like the examples above: Most links generated by the 
> Spark UI result in full path URLs (like /proxy/app-"id"/...) that do not 
> account for a path prefix (context root) and work only if the Spark UI "owns" 
> the entire virtual host. In fact, the only place in the UI where the 
> reverseProxyUrl seems to be used is the back-link from the worker UI to the 
> master UI.
> The discussion on [SPARK-15487] proposes to open a new issue for the problem, 
> but that does not seem to have happened, so this issue aims to address the 
> remaining shortcomings of spark.ui.reverseProxyUrl
> The problem can be partially worked around by doing content rewrite in a 
> front-end proxy and prefixing src="/..." or href="/..." links with a context 
> root. However, detecting and patching URLs in HTML output is not a robust 
> approach and breaks down for URLs included in custom REST responses. E.g. the 
> "allexecutors" REST call used from the Spark 2.1.0 application/executors page 
> returns links for log viewing that direct to the worker UI and do not work in 
> this scenario.
> This issue proposes to honor spark.ui.reverseProxyUrl throughout Spark UI URL 
> generation. Experiments indicate that most of this can simply be achieved by 
> using/prepending spark.ui.reverseProxyUrl to the existing spark.ui.proxyBase 
> system property. Beyond that, the places that require adaption are
> - worker and application links in the master web UI
> - webui URLs returned by REST interfaces
> Note: It seems that returned redirect location headers do not need to be 
> adapted, since URL rewriting for these is commonly done in front-end proxies 
> and has a well-defined interface



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-20049:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

Possibly a doc issue, but I think you've otherwise analyzed it correctly. Open 
a pull request with suggestions.

> Writing data to Parquet with partitions takes very long after the job finishes
> --
>
> Key: SPARK-20049
> URL: https://issues.apache.org/jira/browse/SPARK-20049
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
> GNU/Linux 8.7 (jessie)
>Reporter: Jakub Nowacki
>Priority: Minor
>
> I was testing writing DataFrame to partitioned Parquet files.The command is 
> quite straight forward and the data set is really a sample from larger data 
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in 
> PySpark and UI as finished, the Python interpreter still was showing it as 
> busy. Indeed, when I checked the HDFS folder I noticed that the files are 
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
> folders. 
> First of all it takes much longer than saving the same set without 
> partitioning. Second, it is done in the background, without visible progress 
> of any kind. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

2017-03-22 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936421#comment-15936421
 ] 

Jakub Nowacki commented on SPARK-20049:
---

I did a bit more and the writing and, as it came out, reading performance was 
low due to the number of files per partition most likely. Namely, every folder 
contained the number of files corresponding to the number of partitions of 
saved DataFrame, which was just over 3000 in my case.  Repartitioning like:
{code}
# there is column 'date' in df
df.repartition("date").write.partitionBy("date").parquet("dest_dir")
{code}
fixes the issue, though, creates one file per partition, which is a bit too 
much in my case, but this can be fixed e.g.:
{code}
# there is column 'date' in df
df.repartition("date", 
hour("createdAt")).write.partitionBy("date").parquet("dest_dir")
{code}
which works similarly but files in the partition folders are smaller.

So IMO there are 4 issues to address:
# for some reason there is a long time of writing files on HDFS, which is not 
indicated anywhere and takes much longer than normal write (in my case 5 
minutes vs 1.5 hour)
# some form of additional progress indicator should be included somewhere in 
UI, logs and/or shell output
# suggestion of repartitioning before using {{partitionBy}} should be 
highlighted in the documentation
# maybe automatic repartitioning before saving should be considered, though, 
this can be controversial

> Writing data to Parquet with partitions takes very long after the job finishes
> --
>
> Key: SPARK-20049
> URL: https://issues.apache.org/jira/browse/SPARK-20049
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark, SQL
>Affects Versions: 2.1.0
> Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
> GNU/Linux 8.7 (jessie)
>Reporter: Jakub Nowacki
>
> I was testing writing DataFrame to partitioned Parquet files.The command is 
> quite straight forward and the data set is really a sample from larger data 
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in 
> PySpark and UI as finished, the Python interpreter still was showing it as 
> busy. Indeed, when I checked the HDFS folder I noticed that the files are 
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
> folders. 
> First of all it takes much longer than saving the same set without 
> partitioning. Second, it is done in the background, without visible progress 
> of any kind. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19999) Test failures in Spark Core due to java.nio.Bits.unaligned()

2017-03-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936349#comment-15936349
 ] 

Sean Owen commented on SPARK-1:
---

Although this is going to be a very niche problem, and eventually fixed in the 
JDK, I imagine it's also pretty easy to patch around -- is there a drawback to 
special-casing this arch in the Spark code or will it have another consequence?

> Test failures in Spark Core due to java.nio.Bits.unaligned()
> 
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
> Environment: Ubuntu 14.04 ppc64le 
> $ java -version
> openjdk version "1.8.0_111"
> OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
> OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
>Reporter: Sonia Garudi
>  Labels: ppc64le
>
> There are multiple test failures seen in Spark Core project with the 
> following error message :
> {code:borderStyle=solid}
> java.lang.IllegalArgumentException: requirement failed: No support for 
> unaligned Unsafe. Set spark.memory.offHeap.enabled to false.
> {code}
> These errors occur due to java.nio.Bits.unaligned(), which does not return 
> true for the ppc64le arch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20027) Compilation fixed in java docs.

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20027.
---
   Resolution: Fixed
 Assignee: Prashant Sharma
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/17358

> Compilation fixed in java docs.
> ---
>
> Key: SPARK-20027
> URL: https://issues.apache.org/jira/browse/SPARK-20027
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.2.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>Priority: Trivial
> Fix For: 2.2.0
>
>
> During build/sbt publish-local, build breaks due to javadocs errors. This 
> patch fixes those errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14265) When stage is reRubmitted, DAG visualization does not render correctly for this stage

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14265.
---
Resolution: Not A Problem

> When stage is reRubmitted,  DAG visualization does not render correctly for 
> this stage
> --
>
> Key: SPARK-14265
> URL: https://issues.apache.org/jira/browse/SPARK-14265
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1
>Reporter: KaiXinXIaoLei
> Attachments: dagIsBlank.png
>
>
> I run queries using "bin/spark-sql --master yarn". A  stage run failed, and 
> will be reSubmitted. Then i check the  DAG visualization in web, it's blank.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20059) HbaseCredentialProvider uses wrong classloader

2017-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936329#comment-15936329
 ] 

Apache Spark commented on SPARK-20059:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/17388

> HbaseCredentialProvider uses wrong classloader
> --
>
> Key: SPARK-20059
> URL: https://issues.apache.org/jira/browse/SPARK-20059
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Saisai Shao
>
> {{HBaseCredentialProvider}} uses system classloader instead of child 
> classloader, which will make HBase jars specified with {{--jars}} fail to 
> work, so here we should use the right class loader.
> Besides in yarn client mode jars specified with {{--jars}} is not added into 
> client's class path, which will make it fail to load HBase jars and issue 
> tokens in our scenario. Also some customized credential provider cannot be 
> registered into client.
> So here I will fix this two issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20059) HbaseCredentialProvider uses wrong classloader

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20059:


Assignee: Apache Spark

> HbaseCredentialProvider uses wrong classloader
> --
>
> Key: SPARK-20059
> URL: https://issues.apache.org/jira/browse/SPARK-20059
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>
> {{HBaseCredentialProvider}} uses system classloader instead of child 
> classloader, which will make HBase jars specified with {{--jars}} fail to 
> work, so here we should use the right class loader.
> Besides in yarn client mode jars specified with {{--jars}} is not added into 
> client's class path, which will make it fail to load HBase jars and issue 
> tokens in our scenario. Also some customized credential provider cannot be 
> registered into client.
> So here I will fix this two issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20059) HbaseCredentialProvider uses wrong classloader

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20059:


Assignee: (was: Apache Spark)

> HbaseCredentialProvider uses wrong classloader
> --
>
> Key: SPARK-20059
> URL: https://issues.apache.org/jira/browse/SPARK-20059
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Saisai Shao
>
> {{HBaseCredentialProvider}} uses system classloader instead of child 
> classloader, which will make HBase jars specified with {{--jars}} fail to 
> work, so here we should use the right class loader.
> Besides in yarn client mode jars specified with {{--jars}} is not added into 
> client's class path, which will make it fail to load HBase jars and issue 
> tokens in our scenario. Also some customized credential provider cannot be 
> registered into client.
> So here I will fix this two issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan

2017-03-22 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936303#comment-15936303
 ] 

Nattavut Sutyanyong commented on SPARK-19712:
-

Another scenario of a missed opportunity to convert left outer join to inner 
join:

Query #1 - Exists subquery correlated to the right table of a left outer join
{code}
sql("select * from (select t1a, t2b from t1 left join t2 on t1a = t2a) tx where 
exists (select 1 from t3 where tx.t2b = t3.t3b)").explain(true)

== Optimized Logical Plan ==
Project [t1a#286, t2b#290]
+- Join LeftSemi, (t2b#290 = t3b#293)
   :- Join LeftOuter, (t1a#286 = t2a#289)
   :  :- Project [t1a#286]
   :  :  +- Relation[t1a#286,t1b#287,t1c#288] parquet
   :  +- Project [t2a#289, t2b#290]
   : +- Relation[t2a#289,t2b#290,t2c#291] parquet
   +- Project [1 AS 1#298, t3b#293]
  +- Relation[t3a#292,t3b#293,t3c#294] parquet
{code}

Query #2 - A semantically equivalent query using left semi join
{code}
sql("select * from (select t1a, t2b from t1 left join t2 on t1a = t2a) tx left 
semi join t3 on tx.t2b = t3.t3b").explain(true)

== Optimized Logical Plan ==
Join LeftSemi, (t2b#248 = t3b#251)
:- Project [t1a#244, t2b#248]
:  +- Join Inner, (t1a#244 = t2a#247)
: :- Project [t1a#244]
: :  +- Filter isnotnull(t1a#244)
: : +- Relation[t1a#244,t1b#245,t1c#246] parquet
: +- Project [t2a#247, t2b#248]
:+- Filter (isnotnull(t2b#248) && isnotnull(t2a#247))
:   +- Relation[t2a#247,t2b#248,t2c#249] parquet
+- Project [t3b#251]
   +- Relation[t3a#250,t3b#251,t3c#252] parquet

{code}

In Query #2, the left outer join is rewritten to inner join, which can employ 
more join choices, and could trigger other optimizations.

> EXISTS and Left Semi join do not produce the same plan
> --
>
> Key: SPARK-19712
> URL: https://issues.apache.org/jira/browse/SPARK-19712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nattavut Sutyanyong
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 
> from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> : +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>+- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on 
> t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> : +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>+- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20060) Spark On Non-Yarn Mode with Kerberized HDFS ProxyUser Fails Talking to Hive MetaStore

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20060:


Assignee: (was: Apache Spark)

> Spark On Non-Yarn Mode with Kerberized HDFS ProxyUser Fails Talking to Hive 
> MetaStore 
> --
>
> Key: SPARK-20060
> URL: https://issues.apache.org/jira/browse/SPARK-20060
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>
> For **Spark on non-Yarn** mode on a  kerberized hdfs, we don't obtain 
> credentials from hive metastore, hdfs, etc and just use the local kinited 
> user to connecting them. But if we specify the --proxy-user argument on 
> non-yarn mode, such as local, standalone, after we simply use 
> `UGI.createProxyUser` to get a proxy ugi as the effective user and wrap the 
> code in doAs, the proxy ugi fails to talk to hive metastore cause by no 
> crendentials. Thus, we need to obtain credentials via the real user and add 
> them to the proxy ugi.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20060) Spark On Non-Yarn Mode with Kerberized HDFS ProxyUser Fails Talking to Hive MetaStore

2017-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936293#comment-15936293
 ] 

Apache Spark commented on SPARK-20060:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/17387

> Spark On Non-Yarn Mode with Kerberized HDFS ProxyUser Fails Talking to Hive 
> MetaStore 
> --
>
> Key: SPARK-20060
> URL: https://issues.apache.org/jira/browse/SPARK-20060
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>
> For **Spark on non-Yarn** mode on a  kerberized hdfs, we don't obtain 
> credentials from hive metastore, hdfs, etc and just use the local kinited 
> user to connecting them. But if we specify the --proxy-user argument on 
> non-yarn mode, such as local, standalone, after we simply use 
> `UGI.createProxyUser` to get a proxy ugi as the effective user and wrap the 
> code in doAs, the proxy ugi fails to talk to hive metastore cause by no 
> crendentials. Thus, we need to obtain credentials via the real user and add 
> them to the proxy ugi.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20060) Spark On Non-Yarn Mode with Kerberized HDFS ProxyUser Fails Talking to Hive MetaStore

2017-03-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20060:


Assignee: Apache Spark

> Spark On Non-Yarn Mode with Kerberized HDFS ProxyUser Fails Talking to Hive 
> MetaStore 
> --
>
> Key: SPARK-20060
> URL: https://issues.apache.org/jira/browse/SPARK-20060
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Submit
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>
> For **Spark on non-Yarn** mode on a  kerberized hdfs, we don't obtain 
> credentials from hive metastore, hdfs, etc and just use the local kinited 
> user to connecting them. But if we specify the --proxy-user argument on 
> non-yarn mode, such as local, standalone, after we simply use 
> `UGI.createProxyUser` to get a proxy ugi as the effective user and wrap the 
> code in doAs, the proxy ugi fails to talk to hive metastore cause by no 
> crendentials. Thus, we need to obtain credentials via the real user and add 
> them to the proxy ugi.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20060) Spark On Non-Yarn Mode with Kerberized HDFS ProxyUser Fails Talking to Hive MetaStore

2017-03-22 Thread Kent Yao (JIRA)
Kent Yao created SPARK-20060:


 Summary: Spark On Non-Yarn Mode with Kerberized HDFS ProxyUser 
Fails Talking to Hive MetaStore 
 Key: SPARK-20060
 URL: https://issues.apache.org/jira/browse/SPARK-20060
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Submit
Affects Versions: 2.2.0
Reporter: Kent Yao


For **Spark on non-Yarn** mode on a  kerberized hdfs, we don't obtain 
credentials from hive metastore, hdfs, etc and just use the local kinited user 
to connecting them. But if we specify the --proxy-user argument on non-yarn 
mode, such as local, standalone, after we simply use `UGI.createProxyUser` to 
get a proxy ugi as the effective user and wrap the code in doAs, the proxy ugi 
fails to talk to hive metastore cause by no crendentials. Thus, we need to 
obtain credentials via the real user and add them to the proxy ugi.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20058) the running application status changed from running to waiting when a master is down and it change to another standy by master

2017-03-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936255#comment-15936255
 ] 

Saisai Shao commented on SPARK-20058:
-

Please subscribe this spark user mail list and set the question to this mail 
list.

> the running application status changed  from running to waiting when a master 
> is down and it change to another standy by master
> ---
>
> Key: SPARK-20058
> URL: https://issues.apache.org/jira/browse/SPARK-20058
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: wangweidong
>Priority: Minor
>
> 1.I deployed the spark with cluster mode, test186 is master, test171 is a 
> backup  master, workers is test137, test155 and test138.
> 2, Start spark with command sbin/start-all.sh
> 3, submit my task with command bin/spark-submit --supervise --deployed-mode 
> cluster --master spark://test186:7077 etc.
> 4, view web ui ,enter test186:8080 , I can see my application is running 
> normally.
> 5, Stop the master of test186, after a period of time, view web ui with 
> test171(standby master), I see my application is waiting and it can not 
> change to run, but type one application and enter the detail page, i can see 
> it is actually runing.
> Is it an bug ? or i start spark with incorrect setting?
> Help!!!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20058) the running application status changed from running to waiting when a master is down and it change to another standy by master

2017-03-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936255#comment-15936255
 ] 

Saisai Shao edited comment on SPARK-20058 at 3/22/17 1:03 PM:
--

Please subscribe this spark user mail list and send the question to this mail 
list.


was (Author: jerryshao):
Please subscribe this spark user mail list and set the question to this mail 
list.

> the running application status changed  from running to waiting when a master 
> is down and it change to another standy by master
> ---
>
> Key: SPARK-20058
> URL: https://issues.apache.org/jira/browse/SPARK-20058
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: wangweidong
>Priority: Minor
>
> 1.I deployed the spark with cluster mode, test186 is master, test171 is a 
> backup  master, workers is test137, test155 and test138.
> 2, Start spark with command sbin/start-all.sh
> 3, submit my task with command bin/spark-submit --supervise --deployed-mode 
> cluster --master spark://test186:7077 etc.
> 4, view web ui ,enter test186:8080 , I can see my application is running 
> normally.
> 5, Stop the master of test186, after a period of time, view web ui with 
> test171(standby master), I see my application is waiting and it can not 
> change to run, but type one application and enter the detail page, i can see 
> it is actually runing.
> Is it an bug ? or i start spark with incorrect setting?
> Help!!!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20059) HbaseCredentialProvider uses wrong classloader

2017-03-22 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-20059:
---

 Summary: HbaseCredentialProvider uses wrong classloader
 Key: SPARK-20059
 URL: https://issues.apache.org/jira/browse/SPARK-20059
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.1.0, 2.2.0
Reporter: Saisai Shao


{{HBaseCredentialProvider}} uses system classloader instead of child 
classloader, which will make HBase jars specified with {{--jars}} fail to work, 
so here we should use the right class loader.

Besides in yarn client mode jars specified with {{--jars}} is not added into 
client's class path, which will make it fail to load HBase jars and issue 
tokens in our scenario. Also some customized credential provider cannot be 
registered into client.

So here I will fix this two issues.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2017-03-22 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936222#comment-15936222
 ] 

Jorge Machado edited comment on SPARK-5236 at 3/22/17 12:52 PM:


[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema.  I'm on 1.6.3

I traced down to GeneratePredicates.scala (r: InternalRow) => p.eval(r) and on 
my example it tries to do instanceOf[MutableLong] from a String which fails.  I 
have a filter on the dataframe and a groupby count

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,
val schema: StructType,
val repositoryHistory: 
DeviceHistoryRepository,
val timeZoneId: String,
val tablePartitionInfo: 
TablePartitionInfo,
val from: Long,
val to: Long) extends 
Iterator[InternalRow] {

private val internalItr: ClosableIterator[TagValue[Double]]= 
repositoryHistory.scanTagValues(from, to, tablePartitionInfo)

override def hasNext: Boolean = internalItr.hasNext

override def next(): InternalRow = {
val tagValue = internalItr.next()
val instant = 
ZonedDateTime.ofInstant(Instant.ofEpochSecond(tagValue.getTimestamp), 
ZoneId.of(timeZoneId)).toInstant
val timestamp = Timestamp.from(instant)

InternalRow.fromSeq(Array(tagValue.getTimestamp,tagValue.getGuid,tagValue.getGuid,tagValue.getValue))
val mutableRow = new SpecificMutableRow(schema.fields.map(f=> 
f.dataType))
for (i <- schema.fields.indices){
updateMutableRow(i,tagValue,mutableRow, schema(i) )
}
mutableRow
}

def updateMutableRow(i: Int, tagValue: TagValue[Double], row: 
SpecificMutableRow, field:StructField): Unit = {
//#TODO this is ugly.
field.name match {
case "Date" => 
row.setLong(i,tagValue.getTimestamp.toLong)
case "Device" => 
row.update(i,UTF8String.fromString(tagValue.getGuid))
case "Tag" => 
row.update(i,UTF8String.fromString(tagValue.getTagName))
case "TagValue" => row.setDouble(i,tagValue.getValue)
}
}

override def toString():String ={
"Iterator for Region Name "+tablePartitionInfo.getRegionName+" 
Range:"+from+ "until" + "to"
}
}
{noformat}

Then I get : 

{noformat}
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getLong(SpecificMutableRow.scala:301)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:74)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:72)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

{noformat}


was (Author: jomach):
[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema.  I'm on 1.6.3

I traced down to GeneratePredicates.scala (r: InternalRow) => p.eval(r)

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,

[jira] [Comment Edited] (SPARK-19992) spark-submit on deployment-mode cluster

2017-03-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936241#comment-15936241
 ] 

Saisai Shao edited comment on SPARK-19992 at 3/22/17 12:51 PM:
---

Oh, I see. Check the code again, looks like "/*" cannot be worked with "local" 
schema, only hadoop support schema like hdfs, file could support glob path.

I agree this probably just a setup / env problem.


was (Author: jerryshao):
Oh, I see. Check the code again, looks like "/*" cannot be worked with "local" 
schema, only hadoop support schema like hdfs, file could support glob path.

> spark-submit on deployment-mode cluster
> ---
>
> Key: SPARK-19992
> URL: https://issues.apache.org/jira/browse/SPARK-19992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.2
> Environment: spark version 2.0.2
> hadoop version 2.6.0
>Reporter: narendra maru
>
> spark version 2.0.2
> hadoop version 2.6.0
> spark -submit command
> "spark-submit --class spark.mongohadoop.testing3 --master yarn --deploy-mode 
> cluster --jars /home/ec2-user/jars/hgmongonew.jar, 
> /home/ec2-user/jars/mongo-hadoop-spark-2.0.1.jar"
> after adding following in
> 1 Spark-default.conf
> spark.executor.extraJavaOptions -Dconfig.fuction.conf 
> spark.yarn.jars=local:/usr/local/spark-2.0.2-bin-hadoop2.6/yarn/*
> spark.eventLog.dir=hdfs://localhost:9000/user/spark/applicationHistory
> spark.eventLog.enabled=true
> 2yarn-site.xml
> 
> yarn.application.classpath
> 
> /usr/local/hadoop-2.6.0/etc/hadoop,
> /usr/local/hadoop-2.6.0/,
> /usr/local/hadoop-2.6.0/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/lib/
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/,
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/tools/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/lib/*,
> /usr/local/spark-2.0.2-bin-hadoop2.6/jars/spark-yarn_2.11-2.0.2.jar
> 
> 
> Error on log:-
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ApplicationMaster
> Error on terminal:-
> diagnostics: Application application_1489673977198_0002 failed 2 times due to 
> AM Container for appattempt_1489673977198_0002_02 exited with exitCode: 1 
> For more detailed output, check application tracking 
> page:http://bdg-hdp-sparkmaster:8088/proxy/application_1489673977198_0002/Then,
>  click on links to logs of each attempt.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19992) spark-submit on deployment-mode cluster

2017-03-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936241#comment-15936241
 ] 

Saisai Shao commented on SPARK-19992:
-

Oh, I see. Check the code again, looks like "/*" cannot be worked with "local" 
schema, only hadoop support schema like hdfs, file could support glob path.

> spark-submit on deployment-mode cluster
> ---
>
> Key: SPARK-19992
> URL: https://issues.apache.org/jira/browse/SPARK-19992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.2
> Environment: spark version 2.0.2
> hadoop version 2.6.0
>Reporter: narendra maru
>
> spark version 2.0.2
> hadoop version 2.6.0
> spark -submit command
> "spark-submit --class spark.mongohadoop.testing3 --master yarn --deploy-mode 
> cluster --jars /home/ec2-user/jars/hgmongonew.jar, 
> /home/ec2-user/jars/mongo-hadoop-spark-2.0.1.jar"
> after adding following in
> 1 Spark-default.conf
> spark.executor.extraJavaOptions -Dconfig.fuction.conf 
> spark.yarn.jars=local:/usr/local/spark-2.0.2-bin-hadoop2.6/yarn/*
> spark.eventLog.dir=hdfs://localhost:9000/user/spark/applicationHistory
> spark.eventLog.enabled=true
> 2yarn-site.xml
> 
> yarn.application.classpath
> 
> /usr/local/hadoop-2.6.0/etc/hadoop,
> /usr/local/hadoop-2.6.0/,
> /usr/local/hadoop-2.6.0/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/lib/
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/,
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/tools/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/lib/*,
> /usr/local/spark-2.0.2-bin-hadoop2.6/jars/spark-yarn_2.11-2.0.2.jar
> 
> 
> Error on log:-
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ApplicationMaster
> Error on terminal:-
> diagnostics: Application application_1489673977198_0002 failed 2 times due to 
> AM Container for appattempt_1489673977198_0002_02 exited with exitCode: 1 
> For more detailed output, check application tracking 
> page:http://bdg-hdp-sparkmaster:8088/proxy/application_1489673977198_0002/Then,
>  click on links to logs of each attempt.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2017-03-22 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936222#comment-15936222
 ] 

Jorge Machado edited comment on SPARK-5236 at 3/22/17 12:47 PM:


[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema.  I'm on 1.6.3

I traced down to GeneratePredicates.scala (r: InternalRow) => p.eval(r)

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,
val schema: StructType,
val repositoryHistory: 
DeviceHistoryRepository,
val timeZoneId: String,
val tablePartitionInfo: 
TablePartitionInfo,
val from: Long,
val to: Long) extends 
Iterator[InternalRow] {

private val internalItr: ClosableIterator[TagValue[Double]]= 
repositoryHistory.scanTagValues(from, to, tablePartitionInfo)

override def hasNext: Boolean = internalItr.hasNext

override def next(): InternalRow = {
val tagValue = internalItr.next()
val instant = 
ZonedDateTime.ofInstant(Instant.ofEpochSecond(tagValue.getTimestamp), 
ZoneId.of(timeZoneId)).toInstant
val timestamp = Timestamp.from(instant)

InternalRow.fromSeq(Array(tagValue.getTimestamp,tagValue.getGuid,tagValue.getGuid,tagValue.getValue))
val mutableRow = new SpecificMutableRow(schema.fields.map(f=> 
f.dataType))
for (i <- schema.fields.indices){
updateMutableRow(i,tagValue,mutableRow, schema(i) )
}
mutableRow
}

def updateMutableRow(i: Int, tagValue: TagValue[Double], row: 
SpecificMutableRow, field:StructField): Unit = {
//#TODO this is ugly.
field.name match {
case "Date" => 
row.setLong(i,tagValue.getTimestamp.toLong)
case "Device" => 
row.update(i,UTF8String.fromString(tagValue.getGuid))
case "Tag" => 
row.update(i,UTF8String.fromString(tagValue.getTagName))
case "TagValue" => row.setDouble(i,tagValue.getValue)
}
}

override def toString():String ={
"Iterator for Region Name "+tablePartitionInfo.getRegionName+" 
Range:"+from+ "until" + "to"
}
}
{noformat}

Then I get : 

{noformat}
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getLong(SpecificMutableRow.scala:301)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:74)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:72)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

{noformat}


was (Author: jomach):
[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema. 

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,
val schema: StructType,
val repositoryHistory: 
DeviceHistoryRepository,
val timeZoneId: String,
 

[jira] [Commented] (SPARK-5236) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

2017-03-22 Thread Jorge Machado (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936222#comment-15936222
 ] 

Jorge Machado commented on SPARK-5236:
--

[~marmbrus] Hi Michael, so I'm experience the same issue. I'm building a 
datasource for Hbase with some custom schema. 

{noformat}

/**
  *
  * @param schema this is how the row has to look like. The returned value from 
the next must match this schema
  * @param hBaseRelation
  * @param repositoryHistory
  * @param timeZoneId
  * @param tablePartitionInfo
  * @param from
  * @param to
  */
class TagValueSparkIterator(val hBaseRelation: HBaseRelation,
val schema: StructType,
val repositoryHistory: 
DeviceHistoryRepository,
val timeZoneId: String,
val tablePartitionInfo: 
TablePartitionInfo,
val from: Long,
val to: Long) extends 
Iterator[InternalRow] {

private val internalItr: ClosableIterator[TagValue[Double]]= 
repositoryHistory.scanTagValues(from, to, tablePartitionInfo)

override def hasNext: Boolean = internalItr.hasNext

override def next(): InternalRow = {
val tagValue = internalItr.next()
val instant = 
ZonedDateTime.ofInstant(Instant.ofEpochSecond(tagValue.getTimestamp), 
ZoneId.of(timeZoneId)).toInstant
val timestamp = Timestamp.from(instant)

InternalRow.fromSeq(Array(tagValue.getTimestamp,tagValue.getGuid,tagValue.getGuid,tagValue.getValue))
val mutableRow = new SpecificMutableRow(schema.fields.map(f=> 
f.dataType))
for (i <- schema.fields.indices){
updateMutableRow(i,tagValue,mutableRow, schema(i) )
}
mutableRow
}

def updateMutableRow(i: Int, tagValue: TagValue[Double], row: 
SpecificMutableRow, field:StructField): Unit = {
//#TODO this is ugly.
field.name match {
case "Date" => 
row.setLong(i,tagValue.getTimestamp.toLong)
case "Device" => 
row.update(i,UTF8String.fromString(tagValue.getGuid))
case "Tag" => 
row.update(i,UTF8String.fromString(tagValue.getTagName))
case "TagValue" => row.setDouble(i,tagValue.getValue)
}
}

override def toString():String ={
"Iterator for Region Name "+tablePartitionInfo.getRegionName+" 
Range:"+from+ "until" + "to"
}
}
{noformat}

Then I get : 

{noformat}
Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.getLong(SpecificMutableRow.scala:301)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:68)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:74)
at 
org.apache.spark.sql.execution.Filter$$anonfun$2$$anonfun$apply$2.apply(basicOperators.scala:72)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

{noformat}

> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableAny cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
> -
>
> Key: SPARK-5236
> URL: https://issues.apache.org/jira/browse/SPARK-5236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Alex Baretta
>
> {code}
> 15/01/14 05:39:27 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 18.0 
> (TID 28, localhost): parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block 0 in file gs://pa-truven/20141205/parquet/P/part-r-1.parquet
> at 
> 

[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory

2017-03-22 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936208#comment-15936208
 ] 

Yan Facai (颜发才) commented on SPARK-3728:


RandomForest already use a stack to save node, as [~jgfidelis] said before. 
However, all trees are still kept in memory, see `topNodes`.  

Perhaps, writing trees to disk is still needed if too many trees trained.

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19992) spark-submit on deployment-mode cluster

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19992.
---
Resolution: Not A Problem

[~jerryshao] is "/*" even going to work?
This must be an env problem because YARN cluster mode certainly works, and it's 
not clear the env is correct here. Here the Spark JAR is also being submitted 
with --jars, and it's not a matching version even (though I doubt that's the 
problem).

> spark-submit on deployment-mode cluster
> ---
>
> Key: SPARK-19992
> URL: https://issues.apache.org/jira/browse/SPARK-19992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.2
> Environment: spark version 2.0.2
> hadoop version 2.6.0
>Reporter: narendra maru
>
> spark version 2.0.2
> hadoop version 2.6.0
> spark -submit command
> "spark-submit --class spark.mongohadoop.testing3 --master yarn --deploy-mode 
> cluster --jars /home/ec2-user/jars/hgmongonew.jar, 
> /home/ec2-user/jars/mongo-hadoop-spark-2.0.1.jar"
> after adding following in
> 1 Spark-default.conf
> spark.executor.extraJavaOptions -Dconfig.fuction.conf 
> spark.yarn.jars=local:/usr/local/spark-2.0.2-bin-hadoop2.6/yarn/*
> spark.eventLog.dir=hdfs://localhost:9000/user/spark/applicationHistory
> spark.eventLog.enabled=true
> 2yarn-site.xml
> 
> yarn.application.classpath
> 
> /usr/local/hadoop-2.6.0/etc/hadoop,
> /usr/local/hadoop-2.6.0/,
> /usr/local/hadoop-2.6.0/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/,
> /usr/local/hadoop-2.6.0/share/hadoop/common/lib/
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/,
> /usr/local/hadoop-2.6.0/share/hadoop/hdfs/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/,
> /usr/local/hadoop-2.6.0/share/hadoop/mapreduce/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/tools/lib/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/,
> /usr/local/hadoop-2.6.0/share/hadoop/yarn/lib/*,
> /usr/local/spark-2.0.2-bin-hadoop2.6/jars/spark-yarn_2.11-2.0.2.jar
> 
> 
> Error on log:-
> Error: Could not find or load main class 
> org.apache.spark.deploy.yarn.ApplicationMaster
> Error on terminal:-
> diagnostics: Application application_1489673977198_0002 failed 2 times due to 
> AM Container for appattempt_1489673977198_0002_02 exited with exitCode: 1 
> For more detailed output, check application tracking 
> page:http://bdg-hdp-sparkmaster:8088/proxy/application_1489673977198_0002/Then,
>  click on links to logs of each attempt.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19934) code comments are not very clearly in BlackListTracker.scala

2017-03-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19934.
---
Resolution: Not A Problem

> code comments are not very clearly in BlackListTracker.scala
> 
>
> Key: SPARK-19934
> URL: https://issues.apache.org/jira/browse/SPARK-19934
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Priority: Trivial
>
> {code}
> def handleRemovedExecutor(executorId: String): Unit = {
> // We intentionally do not clean up executors that are already 
> blacklisted in
> // nodeToBlacklistedExecs, so that if another executor on the same node 
> gets blacklisted, we can
> // blacklist the entire node.  We also can't clean up 
> executorIdToBlacklistStatus, so we can
> // eventually remove the executor after the timeout.  Despite not 
> clearing those structures
> // here, we don't expect they will grow too big since you won't get too 
> many executors on one
> // node, and the timeout will clear it up periodically in any case.
> executorIdToFailureList -= executorId
>   }
> {code}
> I think the comments should be:
> {code}
> // We intentionally do not clean up executors that are already blacklisted in
> // nodeToBlacklistedExecs, so that if 
> {spark.blacklist.application.maxFailedExecutorsPerNode} - 1 executor on the 
> same node gets blacklisted, we can
> // blacklist the entire node.
> {code}
> Reference from the design doc 
> https://docs.google.com/document/d/1R2CVKctUZG9xwD67jkRdhBR4sCgccPR2dhTYSRXFEmg/edit.
> when consider update a node to application-level blacklist,should follow rule:
> Nodes are placed into a blacklist for the entire application when the number 
> of blacklisted executors goes over 
> spark.blacklist.application.maxFailedExecutorsPerNode (default 2)
> and the comment just explain as default value



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >