date:20170612

[jira] [Assigned] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21072:


Assignee: (was: Apache Spark)

> `TreeNode.mapChildren` should only apply to the children node. 
> ---
>
> Key: SPARK-21072
> URL: https://issues.apache.org/jira/browse/SPARK-21072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: coneyliu
>
> Just as the function name and comments of `TreeNode.mapChildren` mentioned, 
> the function should be apply to all currently node children. So, the follow 
> code  should judge whether it is the children node.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047431#comment-16047431
 ] 

Apache Spark commented on SPARK-21072:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18284

> `TreeNode.mapChildren` should only apply to the children node. 
> ---
>
> Key: SPARK-21072
> URL: https://issues.apache.org/jira/browse/SPARK-21072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: coneyliu
>
> Just as the function name and comments of `TreeNode.mapChildren` mentioned, 
> the function should be apply to all currently node children. So, the follow 
> code  should judge whether it is the children node.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21072:


Assignee: Apache Spark

> `TreeNode.mapChildren` should only apply to the children node. 
> ---
>
> Key: SPARK-21072
> URL: https://issues.apache.org/jira/browse/SPARK-21072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: coneyliu
>Assignee: Apache Spark
>
> Just as the function name and comments of `TreeNode.mapChildren` mentioned, 
> the function should be apply to all currently node children. So, the follow 
> code  should judge whether it is the children node.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-12 Thread coneyliu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

coneyliu updated SPARK-21072:
-
Description: 
Just as the function name and comments of `TreeNode.mapChildren` mentioned, the 
function should be apply to all currently node children. So, the follow code 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
 should judge whether it is the children node.

```
case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
val newChild1 = f(arg1.asInstanceOf[BaseType])
val newChild2 = f(arg2.asInstanceOf[BaseType])
if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) {
  changed = true
  (newChild1, newChild2)
} else {
  tuple
}
```

  was:
Just as the function name and comments of `TreeNode.mapChildren` mentioned, the 
function should be apply to all currently node children. So, the follow code 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
 should judge whether it is the children node.
```
case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
val newChild1 = f(arg1.asInstanceOf[BaseType])
val newChild2 = f(arg2.asInstanceOf[BaseType])
if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) {
  changed = true
  (newChild1, newChild2)
} else {
  tuple
}
```


> `TreeNode.mapChildren` should only apply to the children node. 
> ---
>
> Key: SPARK-21072
> URL: https://issues.apache.org/jira/browse/SPARK-21072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: coneyliu
>
> Just as the function name and comments of `TreeNode.mapChildren` mentioned, 
> the function should be apply to all currently node children. So, the follow 
> code 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
>  should judge whether it is the children node.
> ```
> case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
> val newChild1 = f(arg1.asInstanceOf[BaseType])
> val newChild2 = f(arg2.asInstanceOf[BaseType])
> if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) 
> {
>   changed = true
>   (newChild1, newChild2)
> } else {
>   tuple
> }
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-12 Thread coneyliu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

coneyliu updated SPARK-21072:
-
Description: 
Just as the function name and comments of `TreeNode.mapChildren` mentioned, the 
function should be apply to all currently node children. So, the follow code  
should judge whether it is the children node.

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]

  was:
Just as the function name and comments of `TreeNode.mapChildren` mentioned, the 
function should be apply to all currently node children. So, the follow code 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
 should judge whether it is the children node.

```
case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
val newChild1 = f(arg1.asInstanceOf[BaseType])
val newChild2 = f(arg2.asInstanceOf[BaseType])
if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) {
  changed = true
  (newChild1, newChild2)
} else {
  tuple
}
```


> `TreeNode.mapChildren` should only apply to the children node. 
> ---
>
> Key: SPARK-21072
> URL: https://issues.apache.org/jira/browse/SPARK-21072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: coneyliu
>
> Just as the function name and comments of `TreeNode.mapChildren` mentioned, 
> the function should be apply to all currently node children. So, the follow 
> code  should judge whether it is the children node.
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-12 Thread coneyliu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

coneyliu updated SPARK-21072:
-
Description: 
Just as the function name and comments of `TreeNode.mapChildren` mentioned, the 
function should be apply to all currently node children. So, the follow code 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
 should judge whether it is the children node.
```
case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
val newChild1 = f(arg1.asInstanceOf[BaseType])
val newChild2 = f(arg2.asInstanceOf[BaseType])
if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) {
  changed = true
  (newChild1, newChild2)
} else {
  tuple
}
```

  was:
Just as the function name and comments of `TreeNode.mapChildren` mentioned, the 
function should be apply to all currently node children. So, the follow code 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
 should judge whether it is the children node.

```
case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
val newChild1 = f(arg1.asInstanceOf[BaseType])
val newChild2 = f(arg2.asInstanceOf[BaseType])
if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) {
  changed = true
  (newChild1, newChild2)
} else {
  tuple
}
```


> `TreeNode.mapChildren` should only apply to the children node. 
> ---
>
> Key: SPARK-21072
> URL: https://issues.apache.org/jira/browse/SPARK-21072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: coneyliu
>
> Just as the function name and comments of `TreeNode.mapChildren` mentioned, 
> the function should be apply to all currently node children. So, the follow 
> code 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
>  should judge whether it is the children node.
> ```
> case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
> val newChild1 = f(arg1.asInstanceOf[BaseType])
> val newChild2 = f(arg2.asInstanceOf[BaseType])
> if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) 
> {
>   changed = true
>   (newChild1, newChild2)
> } else {
>   tuple
> }
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-12 Thread coneyliu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

coneyliu updated SPARK-21072:
-
Description: 
Just as the function name and comments of `TreeNode.mapChildren` mentioned, the 
function should be apply to all currently node children. So, the follow code 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
 should judge whether it is the children node.

```
case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
val newChild1 = f(arg1.asInstanceOf[BaseType])
val newChild2 = f(arg2.asInstanceOf[BaseType])
if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) {
  changed = true
  (newChild1, newChild2)
} else {
  tuple
}
```

  was:
Just as the function name and comments of `TreeNode.mapChildren` mentioned, the 
function should be apply to all currently node children. So, the follow 
code[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
 should judge whether it is the children node.

```
case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
val newChild1 = f(arg1.asInstanceOf[BaseType])
val newChild2 = f(arg2.asInstanceOf[BaseType])
if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) {
  changed = true
  (newChild1, newChild2)
} else {
  tuple
}
```


> `TreeNode.mapChildren` should only apply to the children node. 
> ---
>
> Key: SPARK-21072
> URL: https://issues.apache.org/jira/browse/SPARK-21072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: coneyliu
>
> Just as the function name and comments of `TreeNode.mapChildren` mentioned, 
> the function should be apply to all currently node children. So, the follow 
> code 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
>  should judge whether it is the children node.
> ```
> case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
> val newChild1 = f(arg1.asInstanceOf[BaseType])
> val newChild2 = f(arg2.asInstanceOf[BaseType])
> if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) 
> {
>   changed = true
>   (newChild1, newChild2)
> } else {
>   tuple
> }
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21072) `TreeNode.mapChildren` should only apply to the children node.

2017-06-12 Thread coneyliu (JIRA)

coneyliu created SPARK-21072:


 Summary: `TreeNode.mapChildren` should only apply to the children 
node. 
 Key: SPARK-21072
 URL: https://issues.apache.org/jira/browse/SPARK-21072
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: coneyliu


Just as the function name and comments of `TreeNode.mapChildren` mentioned, the 
function should be apply to all currently node children. So, the follow 
code[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342]
 should judge whether it is the children node.

```
case tuple@(arg1: TreeNode[_], arg2: TreeNode[_]) =>
val newChild1 = f(arg1.asInstanceOf[BaseType])
val newChild2 = f(arg2.asInstanceOf[BaseType])
if (!(newChild1 fastEquals arg1) || !(newChild2 fastEquals arg2)) {
  changed = true
  (newChild1, newChild2)
} else {
  tuple
}
```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19910) `stack` should not reject NULL values due to type mismatch

2017-06-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19910.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0

> `stack` should not reject NULL values due to type mismatch
> --
>
> Key: SPARK-19910
> URL: https://issues.apache.org/jira/browse/SPARK-19910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.3.0
>
>
> Since `stack` function generates a table with nullable columns, it should 
> allow mixed null values.
> {code}
> scala> sql("select stack(3, 1, 2, 3)").printSchema
> root
>  |-- col0: integer (nullable = true)
> scala> sql("select stack(3, 1, 2, null)").printSchema
> org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' 
> due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); 
> line 1 pos 7;
> 'Project [unresolvedalias(stack(3, 1, 2, null), None)]
> +- OneRowRelation$
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21045) Spark executor blocked instead of throwing exception because exception occur when python worker send exception info to Java Gateway in Python 2+

2017-06-12 Thread Joshuawangzj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Environment: 
It has problem only in Python 2+. 
Python 3+ is ok.

Summary: Spark executor blocked instead of throwing exception because 
exception occur when python worker send exception info to Java Gateway in 
Python 2+  (was: Spark executor blocked instead of throwing exception because 
exception occur when python worker send exception info to Java Gateway)

> Spark executor blocked instead of throwing exception because exception occur 
> when python worker send exception info to Java Gateway in Python 2+
> 
>
> Key: SPARK-21045
> URL: https://issues.apache.org/jira/browse/SPARK-21045
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1, 2.0.2, 2.1.1
> Environment: It has problem only in Python 2+. 
> Python 3+ is ok.
>Reporter: Joshuawangzj
>
> My pyspark program is always blocking in product yarn cluster. Then I jstack 
> and found :
> {code}
> "Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
> tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
>java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> - locked <0x0007acab1c98> (a java.io.BufferedInputStream)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
> at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
> at 
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> It is blocking in socket read.  I view the log on blocking executor and found 
> error:
> {code}
> Traceback (most recent call last):
>   File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in 
> main
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
> ordinal not in range(128)
> {code}
> Finally I found the problem:
> {code:title=worker.py|borderStyle=solid}
> # 178 line in spark 2.1.1
> except Exception:
> try:
> write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
> write_with_length(traceback.format_exc().encode("utf-8"), outfile)
> except IOError:
> # JVM close the socket
> pass
> except Exception:
> # Write the error to stderr if it happened while serializing
> print("PySpark worker failed with exception:", file=sys.stderr)
> print(traceback.format_exc(), file=sys.stderr)
> {code}
> when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
> exception like UnicodeDecodeError, the python worker can't send the trace 
> info, but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the 
> trace info length next. So it is blocking.
> {code:title=PythonRDD.scala|borderStyle=solid}
> # 190 line in spark 2.1.1
> case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
>  // Signals that an exception has been thrown in python
>  val exLength = stream.readInt()  // It is possible to be blocked
> {code}
> {color:red}
> We can triggle the bug use simple program:
> {color}
> {code:title=test.py|borderStyle=solid}
> spark = SparkSession.builder.master('local').getOrCreate()
> rdd = spark.sparkContext.parallelize(['中']).map(lambda x: 
> x.encode("utf8"))
> rdd.collect()
> {code}



--
This message was sent by Atlassian JIRA

[jira] [Commented] (SPARK-20988) Convert logistic regression to new aggregator framework

2017-06-12 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047370#comment-16047370
 ] 

Vincent commented on SPARK-20988:
-

I can work on this if no one is working on it now :)

> Convert logistic regression to new aggregator framework
> ---
>
> Key: SPARK-20988
> URL: https://issues.apache.org/jira/browse/SPARK-20988
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Use the hierarchy from SPARK-19762 for logistic regression optimization



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21068) SparkR error message when passed an R object rather than Java object could be more informative

2017-06-12 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047338#comment-16047338
 ] 

Felix Cheung commented on SPARK-21068:
--

surely. what brings you to R land? :)

> SparkR error message when passed an R object rather than Java object could be 
> more informative
> --
>
> Key: SPARK-21068
> URL: https://issues.apache.org/jira/browse/SPARK-21068
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0, 2.3.0
>Reporter: holdenk
>Priority: Trivial
>
> SparkR when passed a non-Java object and expecting a Java object SparkR's 
> backend code has an error message which is `Error: clas(objId) == "jobj" is 
> not TRUE`.
> See backend.R and the isInstanceOf function



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21028) Parallel One vs. Rest Classifier Scala

2017-06-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047258#comment-16047258
 ] 

Apache Spark commented on SPARK-21028:
--

User 'ajaysaini725' has created a pull request for this issue:
https://github.com/apache/spark/pull/18281

> Parallel One vs. Rest Classifier Scala
> --
>
> Key: SPARK-21028
> URL: https://issues.apache.org/jira/browse/SPARK-21028
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Right now the Scala implementation of the one vs. rest algorithm allows for 
> parallelism but does not allow for the amount of parallelism to be tuned. 
> Adding a tunable parameter for the number of jobs to run in parallel at a 
> time would be useful because it would allow the user to adjust the level of 
> parallelism to be optimal for their task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21027:
--
Shepherd: Joseph K. Bradley

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21071) remove append APIs and simplify array writing logic

2017-06-12 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-21071:
---

 Summary: remove append APIs and simplify array writing logic
 Key: SPARK-21071
 URL: https://issues.apache.org/jira/browse/SPARK-21071
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-14450) Python OneVsRest should train multiple models at once

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-14450.
-
Resolution: Duplicate

> Python OneVsRest should train multiple models at once
> -
>
> Key: SPARK-14450
> URL: https://issues.apache.org/jira/browse/SPARK-14450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> [SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
> related to using existing libraries like {{multiprocessing}}, we are not 
> training multiple models in parallel initially.
> This issue is for prototyping, testing, and implementing a way to train 
> multiple models at once.  Speaking with [~joshrosen], a good option might be 
> the concurrent.futures package:
> * Python 3.x: 
> [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
> * Python 2.x: [https://pypi.python.org/pypi/futures]
> We will *not* add this for Spark 2.0, but it will be good to investigate for 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14450) Python OneVsRest should train multiple models at once

2017-06-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047245#comment-16047245
 ] 

Joseph K. Bradley commented on SPARK-14450:
---

See linked JIRA for new issue.

> Python OneVsRest should train multiple models at once
> -
>
> Key: SPARK-14450
> URL: https://issues.apache.org/jira/browse/SPARK-14450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> [SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
> related to using existing libraries like {{multiprocessing}}, we are not 
> training multiple models in parallel initially.
> This issue is for prototyping, testing, and implementing a way to train 
> multiple models at once.  Speaking with [~joshrosen], a good option might be 
> the concurrent.futures package:
> * Python 3.x: 
> [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
> * Python 2.x: [https://pypi.python.org/pypi/futures]
> We will *not* add this for Spark 2.0, but it will be good to investigate for 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14450) Python OneVsRest should train multiple models at once

2017-06-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047244#comment-16047244
 ] 

Joseph K. Bradley commented on SPARK-14450:
---

Scala already has parallelization.  I just rediscovered this issue...so I'll 
close this old copy.

> Python OneVsRest should train multiple models at once
> -
>
> Key: SPARK-14450
> URL: https://issues.apache.org/jira/browse/SPARK-14450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> [SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
> related to using existing libraries like {{multiprocessing}}, we are not 
> training multiple models in parallel initially.
> This issue is for prototyping, testing, and implementing a way to train 
> multiple models at once.  Speaking with [~joshrosen], a good option might be 
> the concurrent.futures package:
> * Python 3.x: 
> [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
> * Python 2.x: [https://pypi.python.org/pypi/futures]
> We will *not* add this for Spark 2.0, but it will be good to investigate for 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047243#comment-16047243
 ] 

Joseph K. Bradley commented on SPARK-21027:
---

Copying from [ML-14450]:

[SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
related to using existing libraries like {{multiprocessing}}, we are not 
training multiple models in parallel initially.

This issue is for prototyping, testing, and implementing a way to train 
multiple models at once.  Speaking with [~joshrosen], a good option might be 
the concurrent.futures package:
* Python 3.x: 
[https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
* Python 2.x: [https://pypi.python.org/pypi/futures]

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047243#comment-16047243
 ] 

Joseph K. Bradley edited comment on SPARK-21027 at 6/12/17 11:54 PM:
-

Copying from [SPARK-14450]:

[SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
related to using existing libraries like {{multiprocessing}}, we are not 
training multiple models in parallel initially.

This issue is for prototyping, testing, and implementing a way to train 
multiple models at once.  Speaking with [~joshrosen], a good option might be 
the concurrent.futures package:
* Python 3.x: 
[https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
* Python 2.x: [https://pypi.python.org/pypi/futures]


was (Author: josephkb):
Copying from [ML-14450]:

[SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
related to using existing libraries like {{multiprocessing}}, we are not 
training multiple models in parallel initially.

This issue is for prototyping, testing, and implementing a way to train 
multiple models at once.  Speaking with [~joshrosen], a good option might be 
the concurrent.futures package:
* Python 3.x: 
[https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
* Python 2.x: [https://pypi.python.org/pypi/futures]

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047241#comment-16047241
 ] 

Joseph K. Bradley commented on SPARK-21027:
---

Whoops! I realized I'd reported this long ago...and I'd said there might be 
issues with python multiprocessing.  I don't recall what those issues were.  
Let's investigate.

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2017-06-12 Thread Dayou Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047233#comment-16047233
 ] 

Dayou Zhou commented on SPARK-18294:


Thanks for responding.  My colleague Aarati Khobare will provide you with the 
details.

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21047) Add test suites for complicated cases in ColumnarBatchSuite

2017-06-12 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-21047:
-
Summary: Add test suites for complicated cases in ColumnarBatchSuite  (was: 
Add test suites for complicated cases)

> Add test suites for complicated cases in ColumnarBatchSuite
> ---
>
> Key: SPARK-21047
> URL: https://issues.apache.org/jira/browse/SPARK-21047
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>
> Current {{ColumnarBatchSuite}} has very simple test cases for array. This 
> JIRA will add test suites for complicated cases such as nested array in 
> {{ColumnVector}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2017-06-12 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047229#comment-16047229
 ] 

Jiang Xingbo commented on SPARK-18294:
--

This is actually legacy code refactoring, it shouldn't affect common user case 
because the old code is still valid. Could you expand on why you need this?

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2017-06-12 Thread Dayou Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047196#comment-16047196
 ] 

Dayou Zhou commented on SPARK-18294:


Hi [~jiangxb1987][~jiangxb], 
Thank you for making this fix -- we really need this fix.  Just checking with 
you the status on this and when you can merge it into master?

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21070) Pick up cloudpickle upgrades from cloudpickle python module

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21070:


Assignee: Apache Spark

> Pick up cloudpickle upgrades from cloudpickle python module
> ---
>
> Key: SPARK-21070
> URL: https://issues.apache.org/jira/browse/SPARK-21070
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.1.1
>Reporter: Kyle Kelley
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21070) Pick up cloudpickle upgrades from cloudpickle python module

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21070:


Assignee: (was: Apache Spark)

> Pick up cloudpickle upgrades from cloudpickle python module
> ---
>
> Key: SPARK-21070
> URL: https://issues.apache.org/jira/browse/SPARK-21070
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.1.1
>Reporter: Kyle Kelley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21070) Pick up cloudpickle upgrades from cloudpickle python module

2017-06-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047153#comment-16047153
 ] 

Apache Spark commented on SPARK-21070:
--

User 'rgbkrk' has created a pull request for this issue:
https://github.com/apache/spark/pull/18282

> Pick up cloudpickle upgrades from cloudpickle python module
> ---
>
> Key: SPARK-21070
> URL: https://issues.apache.org/jira/browse/SPARK-21070
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0, 2.1.1
>Reporter: Kyle Kelley
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21070) Pick up cloudpickle upgrades from cloudpickle python module

2017-06-12 Thread Kyle Kelley (JIRA)

Kyle Kelley created SPARK-21070:
---

 Summary: Pick up cloudpickle upgrades from cloudpickle python 
module
 Key: SPARK-21070
 URL: https://issues.apache.org/jira/browse/SPARK-21070
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.1.1, 2.1.0, 2.0.0
Reporter: Kyle Kelley
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21069) Add rate source to programming guide

2017-06-12 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21069:
-
Labels: starter  (was: )

> Add rate source to programming guide
> 
>
> Key: SPARK-21069
> URL: https://issues.apache.org/jira/browse/SPARK-21069
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>  Labels: starter
>
> SPARK-20979 added a new structured streaming source: rate source. We should 
> document it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20979) Add a rate source to generate values for tests and benchmark

2017-06-12 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-20979:
-
Affects Version/s: (was: 2.2.0)
   2.3.0

> Add a rate source to generate values for tests and benchmark
> 
>
> Key: SPARK-20979
> URL: https://issues.apache.org/jira/browse/SPARK-20979
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047127#comment-16047127
 ] 

Apache Spark commented on SPARK-21027:
--

User 'ajaysaini725' has created a pull request for this issue:
https://github.com/apache/spark/pull/18281

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21027:


Assignee: (was: Apache Spark)

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21027:


Assignee: Apache Spark

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>Assignee: Apache Spark
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21069) Add rate source to programming guide

2017-06-12 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-21069:


 Summary: Add rate source to programming guide
 Key: SPARK-21069
 URL: https://issues.apache.org/jira/browse/SPARK-21069
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Structured Streaming
Affects Versions: 2.3.0
Reporter: Shixiong Zhu


SPARK-20979 added a new structured streaming source: rate source. We should 
document it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20979) Add a rate source to generate values for tests and benchmark

2017-06-12 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20979.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add a rate source to generate values for tests and benchmark
> 
>
> Key: SPARK-20979
> URL: https://issues.apache.org/jira/browse/SPARK-20979
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-20379) Allow setting SSL-related passwords through env variables

2017-06-12 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-20379:


I'll re-open this since the SSL options don't really use the code path that can 
reference env variables in config values, so this is still something we should 
add. Fix will be different than my previous PR though.

> Allow setting SSL-related passwords through env variables
> -
>
> Key: SPARK-20379
> URL: https://issues.apache.org/jira/browse/SPARK-20379
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> Currently, Spark reads all SSL options from configuration, which can be 
> provided in a file or through the command line. This means that to set the 
> SSL keystore / trust store / key passwords, you have to use one of those 
> options.
> Using the command line for that is not secure, and in some environments 
> admins prefer to not have the password written in plain text in a file (since 
> the file and the data it's protecting could be stored in the same 
> filesystem). So for these cases it would be nice to be able to provide these 
> passwords through environment variables, which are not written to disk and 
> also not readable by other users on the same machine.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21068) SparkR error message when passed an R object rather than Java object could be more informative

2017-06-12 Thread holdenk (JIRA)

holdenk created SPARK-21068:
---

 Summary: SparkR error message when passed an R object rather than 
Java object could be more informative
 Key: SPARK-21068
 URL: https://issues.apache.org/jira/browse/SPARK-21068
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.0, 2.3.0
Reporter: holdenk
Priority: Trivial


SparkR when passed a non-Java object and expecting a Java object SparkR's 
backend code has an error message which is `Error: clas(objId) == "jobj" is not 
TRUE`.

See backend.R and the isInstanceOf function



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21050) ml word2vec write has overflow issue in calculating numPartitions

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21050.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 18265
[https://github.com/apache/spark/pull/18265]

> ml word2vec write has overflow issue in calculating numPartitions
> -
>
> Key: SPARK-21050
> URL: https://issues.apache.org/jira/browse/SPARK-21050
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 2.2.1, 2.3.0
>
>
> The method calculateNumberOfPartitions() uses Int, not Long (unlike the MLlib 
> version), so it is very easily to have an overflow in calculating the number 
> of partitions for ML persistence.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20499) Spark MLlib, GraphX 2.2 QA umbrella

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20499:
--
Fix Version/s: 2.2.0

> Spark MLlib, GraphX 2.2 QA umbrella
> ---
>
> Key: SPARK-20499
> URL: https://issues.apache.org/jira/browse/SPARK-20499
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.2.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX.   *SparkR is separate: [SPARK-20508].*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Check binary API compatibility for Scala/Java
> * Audit new public APIs (from the generated html doc)
> ** Scala
> ** Java compatibility
> ** Python coverage
> * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
> * Performance tests
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20507) Update MLlib, GraphX websites for 2.2

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20507:
--
Fix Version/s: 2.2.0

> Update MLlib, GraphX websites for 2.2
> -
>
> Key: SPARK-20507
> URL: https://issues.apache.org/jira/browse/SPARK-20507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>Priority: Critical
> Fix For: 2.2.0
>
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21059) LikeSimplification can NPE on null pattern

2017-06-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21059.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> LikeSimplification can NPE on null pattern
> --
>
> Key: SPARK-21059
> URL: https://issues.apache.org/jira/browse/SPARK-21059
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20345) Fix STS error handling logic on HiveSQLException

2017-06-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-20345.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0
   2.2.1

> Fix STS error handling logic on HiveSQLException
> 
>
> Key: SPARK-20345
> URL: https://issues.apache.org/jira/browse/SPARK-20345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.2.1, 2.3.0
>
>
> [SPARK-5100|https://github.com/apache/spark/commit/343d3bfafd449a0371feb6a88f78e07302fa7143]
>  added Spark Thrift Server UI and the following logic to handle exceptions on 
> case `Throwable`.
> {code}
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> {code}
> However, there occurred a missed case after implementing 
> [SPARK-6964|https://github.com/apache/spark/commit/eb19d3f75cbd002f7e72ce02017a8de67f562792]'s
>  `Support Cancellation in the Thrift Server` by adding case 
> `HiveSQLException` before case `Throwable`.
> Logically, we had better add `HiveThriftServer2.listener.onStatementError` on 
> case `HiveSQLException`, too.
> {code}
>   case e: HiveSQLException =>
> if (getStatus().getState() == OperationState.CANCELED) {
>   return
> } else {
>   setState(OperationState.ERROR)
>   throw e
> }
>   // Actually do need to catch Throwable as some failures don't inherit 
> from Exception and
>   // HiveServer will silently swallow them.
>   case e: Throwable =>
> val currentState = getStatus().getState()
> logError(s"Error executing query, currentState $currentState, ", e)
> setState(OperationState.ERROR)
> HiveThriftServer2.listener.onStatementError(
>   statementId, e.getMessage, SparkUtils.exceptionString(e))
> throw new HiveSQLException(e.toString)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-06-12 Thread Dominic Ricard (JIRA)

Dominic Ricard created SPARK-21067:
--

 Summary: Thrift Server - CTAS fail with Unable to move source
 Key: SPARK-21067
 URL: https://issues.apache.org/jira/browse/SPARK-21067
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
 Environment: Yarn
Hive MetaStore
HDFS (HA)
Reporter: Dominic Ricard


After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
would fail, sometimes...

Most of the time, the CTAS would work only once after starting the thrift 
server. After that, dropping the table and re-issuing the same CTAS would fail 
with the following message (Sometime, it fails right away, sometime it work for 
a long period of time):

{noformat}
Error: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
 to destination 
hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
(state=,code=0)
{noformat}

We have already found the following Jira 
(https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
{{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
ours set to "/tmp/hive-staging/\{user.name\}"

Same issue with INSERT statements:
{noformat}
CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
dricard.test SELECT 1;
Error: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
 to destination 
hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
(state=,code=0)
{noformat}

This worked fine in 1.6.2, which we currently run in our Production Environment 
but since 2.0+, we haven't been able to CREATE TABLE consistently on the 
cluster.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-06-12 Thread Zhenhua Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047045#comment-16047045
 ] 

Zhenhua Wang commented on SPARK-17642:
--

[~mbasmanova] I've reopened and rebased the above PR. IMO, I also think 
security should not be a blocking issue for the PR since now we don't have 
security mechanism inside spark. Let's see others' comments.

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics are supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2017-06-12 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-17914.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 18252
[https://github.com/apache/spark/pull/18252]

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>Assignee: Anton Okolnychyi
> Fix For: 2.2.1, 2.3.0
>
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17914) Spark SQL casting to TimestampType with nanosecond results in incorrect timestamp

2017-06-12 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-17914:
-

Assignee: Anton Okolnychyi

> Spark SQL casting to TimestampType with nanosecond results in incorrect 
> timestamp
> -
>
> Key: SPARK-17914
> URL: https://issues.apache.org/jira/browse/SPARK-17914
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Oksana Romankova
>Assignee: Anton Okolnychyi
>
> In some cases when timestamps contain nanoseconds they will be parsed 
> incorrectly. 
> Examples: 
> "2016-05-14T15:12:14.0034567Z" -> "2016-05-14 15:12:14.034567"
> "2016-05-14T15:12:14.000345678Z" -> "2016-05-14 15:12:14.345678"
> The issue seems to be happening in DateTimeUtils.stringToTimestamp(). It 
> assumes that only 6 digit fraction of a second will be passed.
> With this being the case I would suggest either discarding nanoseconds 
> automatically, or throw an exception prompting to pre-format timestamps to 
> microsecond precision first before casting to the Timestamp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2017-06-12 Thread Dan Dutrow (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-21065:
---
Description: 
My streaming application has 200+ output operations, many of them stateful and 
several of them windowed. In an attempt to reduce the processing times, I set 
"spark.streaming.concurrentJobs" to 2+. Initial results are very positive, 
cutting our processing time from ~3 minutes to ~1 minute, but eventually we 
encounter an exception as follows:

(Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
batch from 45 minutes before the exception is thrown.)

2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
StreamingJobProgressListener threw an exception
java.util.NoSuchElementException: key not found 149697756 ms
at scala.collection.MalLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
at 
org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
...


The Spark code causing the exception is here:

https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
  override def onOutputOperationCompleted(
  outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
Unit = synchronized {
// This method is called before onBatchCompleted
{color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
  updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
}

It seems to me that it may be caused by that batch being removed earlier.
https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102

  override def onBatchCompleted(batchCompleted: 
StreamingListenerBatchCompleted): Unit = {
synchronized {
  waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
  
{color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
  val batchUIData = BatchUIData(batchCompleted.batchInfo)
  completedBatchUIData.enqueue(batchUIData)
  if (completedBatchUIData.size > batchUIDataLimit) {
val removedBatch = completedBatchUIData.dequeue()
batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
  }
  totalCompletedBatches += 1L

  totalProcessedRecords += batchUIData.numRecords
}
}

What is the solution here? Should I make my spark streaming context remember 
duration a lot longer? ssc.remember(batchDuration * rememberMultiple)

Otherwise, it seems like there should be some kind of existence check on 
runningBatchUIData before dereferencing it.


  was:
My streaming application has 200+ output operations, many of them stateful and 
several of them windowed. In an attempt to reduce the processing times, I set 
"spark.streaming.concurrentJobs" to 2+. Initial results are very positive, 
cutting our processing time from ~3 minutes to ~1 minute, but eventually we 
encounter an exception as follows:

Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
batch from 45 minutes before the exception is thrown.

2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
StreamingJobProgressListener threw an exception
java.util.NoSuchElementException: key not found 149697756 ms
at scala.collection.MalLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
at 
org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
...


The Spark code

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-06-12 Thread Sital Kedia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046957#comment-16046957
 ] 

Sital Kedia commented on SPARK-18838:
-

[~joshrosen] - The PR for my change to multi-thread the event processor has 
been closed inadvertently as a stale PR. Can you take a look at the approach 
and let me know what you think. Please note that this change is very critical 
for running large jobs in our environment. 
https://github.com/apache/spark/pull/16291

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20434) Move Hadoop delegation token code from yarn to core

2017-06-12 Thread Michael Gummelt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-20434:

Summary: Move Hadoop delegation token code from yarn to core  (was: Move 
Kerberos delegation token code from yarn to core)

> Move Hadoop delegation token code from yarn to core
> ---
>
> Key: SPARK-20434
> URL: https://issues.apache.org/jira/browse/SPARK-20434
> Project: Spark
>  Issue Type: Task
>  Components: Mesos, Spark Core, YARN
>Affects Versions: 2.1.0
>Reporter: Michael Gummelt
>
> This is to enable kerberos support for other schedulers, such as Mesos.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20511:
-

Assignee: Felix Cheung

> SparkR 2.2 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-20511
> URL: https://issues.apache.org/jira/browse/SPARK-20511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
> Fix For: 2.2.0
>
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18864:
--
Fix Version/s: 2.2.0

> Changes of MLlib and SparkR behavior for 2.2
> 
>
> Key: SPARK-18864
> URL: https://issues.apache.org/jira/browse/SPARK-18864
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
> Fix For: 2.2.0
>
>
> This JIRA is for tracking changes of behavior within MLlib and SparkR for the 
> Spark 2.2 release.  If any JIRAs change behavior, please list them below with 
> a short description of the change.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20511:
--
Fix Version/s: 2.2.0

> SparkR 2.2 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-20511
> URL: https://issues.apache.org/jira/browse/SPARK-20511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
> Fix For: 2.2.0
>
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20508) Spark R 2.2 QA umbrella

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20508:
-

Assignee: Felix Cheung  (was: Joseph K. Bradley)

> Spark R 2.2 QA umbrella
> ---
>
> Key: SPARK-20508
> URL: https://issues.apache.org/jira/browse/SPARK-20508
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.2.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-20513) Update SparkR website for 2.2

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley deleted SPARK-20513:
--


> Update SparkR website for 2.2
> -
>
> Key: SPARK-20513
> URL: https://issues.apache.org/jira/browse/SPARK-20513
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-project's website to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20508) Spark R 2.2 QA umbrella

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20508:
--
Fix Version/s: 2.2.0

> Spark R 2.2 QA umbrella
> ---
>
> Key: SPARK-20508
> URL: https://issues.apache.org/jira/browse/SPARK-20508
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.2.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for SparkR.
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
> * Audit new public APIs (from the generated html doc)
> ** relative to Spark Scala/Java APIs
> ** relative to popular R libraries
> h2. Documentation and example code
> * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
> * Update Programming Guide
> * Update website



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20510:
--
Fix Version/s: 2.2.0

> SparkR 2.2 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-20510
> URL: https://issues.apache.org/jira/browse/SPARK-20510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.2.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20510:
-

Assignee: Felix Cheung

> SparkR 2.2 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-20510
> URL: https://issues.apache.org/jira/browse/SPARK-20510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.2.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20513) Update SparkR website for 2.2

2017-06-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046843#comment-16046843
 ] 

Joseph K. Bradley commented on SPARK-20513:
---

whoops, i'll delete this...

> Update SparkR website for 2.2
> -
>
> Key: SPARK-20513
> URL: https://issues.apache.org/jira/browse/SPARK-20513
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-project's website to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20512:
--
Fix Version/s: 2.2.0

> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.2.0
>
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-18864].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-20505].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20512) SparkR 2.2 QA: Programming guide, migration guide, vignettes updates

2017-06-12 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20512:
-

Assignee: Felix Cheung

> SparkR 2.2 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-20512
> URL: https://issues.apache.org/jira/browse/SPARK-20512
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Felix Cheung
>Priority: Critical
> Fix For: 2.2.0
>
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-18864].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-20505].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21066) LibSVM load just one input file

2017-06-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046812#comment-16046812
 ] 

Sean Owen commented on SPARK-21066:
---

CC [~lian cheng] 

I don't immediately see why the relation can't examine multiple files to 
compute the number of features, either.

> LibSVM load just one input file
> ---
>
> Key: SPARK-21066
> URL: https://issues.apache.org/jira/browse/SPARK-21066
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>
> Currently when we using SVM to train dataset we found the input files limit 
> only one .
> The file store on the Distributed File System such as HDFS is split into 
> mutil piece and I think this limit is not necessary .
>  We can join input paths into a string split with comma. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-06-12 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046798#comment-16046798
 ] 

Dongjoon Hyun commented on SPARK-17642:
---

Sorry, I cannot help you further with that because I'm not a Databricks' guy.

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics are supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-06-12 Thread Maria (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046794#comment-16046794
 ] 

Maria commented on SPARK-17642:
---

[~dongjoon], where can I read more on this? I found blog post [1] announcing 
built-in access control, but no more further specifics.

[1] 
https://databricks.com/blog/2017/05/24/databricks-runtime-3-0-beta-delivers-enterprise-grade-apache-spark.html
 

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics are supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2017-06-12 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046740#comment-16046740
 ] 

Shixiong Zhu commented on SPARK-20927:
--

Do nothing except logging a warning

> Add cache operator to Unsupported Operations in Structured Streaming 
> -
>
> Key: SPARK-20927
> URL: https://issues.apache.org/jira/browse/SPARK-20927
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> Just [found 
> out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
>  that {{cache}} is not allowed on streaming datasets.
> {{cache}} on streaming datasets leads to the following exception:
> {code}
> scala> spark.readStream.text("files").cache
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with writeStream.start();;
> FileSource[files]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
>   at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
>   at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
>   at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
>   ... 48 elided
> {code}
> It should be included in Structured Streaming's [Unsupported 
> Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-06-12 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046739#comment-16046739
 ] 

Dongjoon Hyun commented on SPARK-17642:
---

I see. Interesting. At a first glance, the comments seem to be related to 
Databricks 3.0 (beta) instead of Apache Spark 2.2. Some features in 
`com.databricks.sql.acl.HiveAclExtensions` and 
`com.databricks.spark.sql.acl.client.SparkSqlAclClient`?

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics are supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21046) simplify the array offset and length in ColumnVector

2017-06-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21046.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18260
[https://github.com/apache/spark/pull/18260]

> simplify the array offset and length in ColumnVector
> 
>
> Key: SPARK-21046
> URL: https://issues.apache.org/jira/browse/SPARK-21046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-06-12 Thread Maria (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046718#comment-16046718
 ] 

Maria commented on SPARK-17642:
---

[~dongjoon], thanks for such a quick response. 

Attached PR [1] introduces 'DESC EXTENDED table-name column-name' SQL command 
to show column-level statistics: min/max, number of distinct values, 
histograms, etc. [~smilegator] raised a security concern: a user who doesn't 
have access to column C may still learn about the data in that column. 

"After rethinking about it, DESC EXTENDED/FORMATTED COLUMN discloses the data 
patterns/statistics info. These info are pretty sensitive. Not all the users 
should be allowed to access it.

We might face the security-related complaints about this feature.

...

Column-level security can block users to access the specific columns, but this 
command DESC EXTENDED/FORMATTED COLUMN might not be part of the 
design/solution."

[1] https://github.com/apache/spark/pull/16422

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics are supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21066) LibSVM load just one input file

2017-06-12 Thread darion yaphet (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

darion yaphet updated SPARK-21066:
--
Description: 
Currently when we using SVM to train dataset we found the input files limit 
only one .

The file store on the Distributed File System such as HDFS is split into mutil 
piece and I think this limit is not necessary .

 We can join input paths into a string split with comma. 

  was:
Currently when we using SVM to train dataset we found the input files limit 
only one .

the source code as following :
{{{
 val path = if (dataFiles.length == 1) {
dataFiles.head.getPath.toUri.toString
 } else if (dataFiles.isEmpty) {
throw new IOException("No input path specified for libsvm data")
 } else {
throw new IOException("Multiple input paths are not supported for 
libsvm data.")
 }
}}}

The file store on the Distributed File System such as HDFS is split into mutil 
piece and I think this limit is not necessary . We can join input paths into a 
string split with comma. 


> LibSVM load just one input file
> ---
>
> Key: SPARK-21066
> URL: https://issues.apache.org/jira/browse/SPARK-21066
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: darion yaphet
>
> Currently when we using SVM to train dataset we found the input files limit 
> only one .
> The file store on the Distributed File System such as HDFS is split into 
> mutil piece and I think this limit is not necessary .
>  We can join input paths into a string split with comma. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21066) LibSVM load just one input file

2017-06-12 Thread darion yaphet (JIRA)

darion yaphet created SPARK-21066:
-

 Summary: LibSVM load just one input file
 Key: SPARK-21066
 URL: https://issues.apache.org/jira/browse/SPARK-21066
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.1
Reporter: darion yaphet


Currently when we using SVM to train dataset we found the input files limit 
only one .

the source code as following :
{{{
 val path = if (dataFiles.length == 1) {
dataFiles.head.getPath.toUri.toString
 } else if (dataFiles.isEmpty) {
throw new IOException("No input path specified for libsvm data")
 } else {
throw new IOException("Multiple input paths are not supported for 
libsvm data.")
 }
}}}

The file store on the Distributed File System such as HDFS is split into mutil 
piece and I think this limit is not necessary . We can join input paths into a 
string split with comma. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-06-12 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046694#comment-16046694
 ] 

Dongjoon Hyun commented on SPARK-17642:
---

Hi, [~mbasmanova].
Spark-LLAP is a third party library. Is it related to this CBO issue in Apache 
Spark?

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics are supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-06-12 Thread Maria (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046664#comment-16046664
 ] 

Maria commented on SPARK-17642:
---

Folks,

It seems to me that column-level access control is implemented outside of Spark 
in spark-llap library. It also looks like the responsibility for making Spark 
SQL commands *secure* is the responsibility of the maintainers of LLAP. I'm 
making this assumption based on the following PRs which have hardened existing 
DESC table command:

https://github.com/hortonworks-spark/spark-llap/pull/18
https://github.com/hortonworks-spark/spark-llap/pull/134

[~dongjoon], could you confirm? 

If that's the case, then wouldn't heads-up to [~dongjoon] be sufficient to 
unblock attached PR?

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command.
> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics.
> We should resolve this jira after column-level statistics are supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20715) MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and MapOutputTracker

2017-06-12 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-20715.

   Resolution: Fixed
Fix Version/s: 2.3.0

Fixed for 2.3.0.

> MapStatuses shouldn't be redundantly stored in both ShuffleMapStage and 
> MapOutputTracker
> 
>
> Key: SPARK-20715
> URL: https://issues.apache.org/jira/browse/SPARK-20715
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle
>Affects Versions: 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.3.0
>
>
> Today the MapOutputTracker and ShuffleMapStage both maintain their own copies 
> of MapStatuses. This creates the potential for bugs in case these two pieces 
> of state become out of sync.
> I believe that we can improve our ability to reason about the code by storing 
> this information only in the MapOutputTracker. This can also help to reduce 
> driver memory consumption.
> I will provide more details in my PR, where I'll walk through the detailed 
> arguments as to why we can take these two different metadata tracking formats 
> and consolidate without loss of performance or correctness.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2017-06-12 Thread Dan Dutrow (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046649#comment-16046649
 ] 

Dan Dutrow commented on SPARK-21065:


Something to note: If one batch's processing time exceeds the batch interval, 
then a second batch could begin before the first is complete. This is fine 
behavior for us, and offers the ability to catch up if something gets delayed, 
but may be confusing for the scheduler.

> Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
> --
>
> Key: SPARK-21065
> URL: https://issues.apache.org/jira/browse/SPARK-21065
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler
>Affects Versions: 2.1.0
>Reporter: Dan Dutrow
>
> My streaming application has 200+ output operations, many of them stateful 
> and several of them windowed. In an attempt to reduce the processing times, I 
> set "spark.streaming.concurrentJobs" to 2+. Initial results are very 
> positive, cutting our processing time from ~3 minutes to ~1 minute, but 
> eventually we encounter an exception as follows:
> Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
> batch from 45 minutes before the exception is thrown.
> 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
> org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
> StreamingJobProgressListener threw an exception
> java.util.NoSuchElementException: key not found 149697756 ms
> at scala.collection.MalLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
> ...
> The Spark code causing the exception is here:
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
>   override def onOutputOperationCompleted(
>   outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
> Unit = synchronized {
> // This method is called before onBatchCompleted
> {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
>   updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
> }
> It seems to me that it may be caused by that batch being removed earlier.
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102
>   override def onBatchCompleted(batchCompleted: 
> StreamingListenerBatchCompleted): Unit = {
> synchronized {
>   waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
>   
> {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
>   val batchUIData = BatchUIData(batchCompleted.batchInfo)
>   completedBatchUIData.enqueue(batchUIData)
>   if (completedBatchUIData.size > batchUIDataLimit) {
> val removedBatch = completedBatchUIData.dequeue()
> batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
>   }
>   totalCompletedBatches += 1L
>   totalProcessedRecords += batchUIData.numRecords
> }
> }
> What is the solution here? Should I make my spark streaming context remember 
> duration a lot longer? ssc.remember(batchDuration * rememberMultiple)
> Otherwise, it seems like there should be some kind of existence check on 
> runningBatchUIData before dereferencing it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2017-06-12 Thread Dan Dutrow (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-21065:
---
Component/s: DStreams

> Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
> --
>
> Key: SPARK-21065
> URL: https://issues.apache.org/jira/browse/SPARK-21065
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler
>Affects Versions: 2.1.0
>Reporter: Dan Dutrow
>
> My streaming application has 200+ output operations, many of them stateful 
> and several of them windowed. In an attempt to reduce the processing times, I 
> set "spark.streaming.concurrentJobs" to 2+. Initial results are very 
> positive, cutting our processing time from ~3 minutes to ~1 minute, but 
> eventually we encounter an exception as follows:
> Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
> batch from 45 minutes before the exception is thrown.
> 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
> org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
> StreamingJobProgressListener threw an exception
> java.util.NoSuchElementException: key not found 149697756 ms
> at scala.collection.MalLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
> at 
> org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
> at 
> org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
> ...
> The Spark code causing the exception is here:
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
>   override def onOutputOperationCompleted(
>   outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
> Unit = synchronized {
> // This method is called before onBatchCompleted
> {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
>   updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
> }
> It seems to me that it may be caused by that batch being removed earlier.
> https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102
>   override def onBatchCompleted(batchCompleted: 
> StreamingListenerBatchCompleted): Unit = {
> synchronized {
>   waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
>   
> {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
>   val batchUIData = BatchUIData(batchCompleted.batchInfo)
>   completedBatchUIData.enqueue(batchUIData)
>   if (completedBatchUIData.size > batchUIDataLimit) {
> val removedBatch = completedBatchUIData.dequeue()
> batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
>   }
>   totalCompletedBatches += 1L
>   totalProcessedRecords += batchUIData.numRecords
> }
> }
> What is the solution here? Should I make my spark streaming context remember 
> duration a lot longer? ssc.remember(batchDuration * rememberMultiple)
> Otherwise, it seems like there should be some kind of existence check on 
> runningBatchUIData before dereferencing it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2017-06-12 Thread Dan Dutrow (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-21065:
---
Description: 
My streaming application has 200+ output operations, many of them stateful and 
several of them windowed. In an attempt to reduce the processing times, I set 
"spark.streaming.concurrentJobs" to 2+. Initial results are very positive, 
cutting our processing time from ~3 minutes to ~1 minute, but eventually we 
encounter an exception as follows:

Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
batch from 45 minutes before the exception is thrown.

2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
StreamingJobProgressListener threw an exception
java.util.NoSuchElementException: key not found 149697756 ms
at scala.collection.MalLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
at 
org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
...


The Spark code causing the exception is here:

https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
  override def onOutputOperationCompleted(
  outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
Unit = synchronized {
// This method is called before onBatchCompleted
{color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
  updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
}

It seems to me that it may be caused by that batch being removed earlier.
https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102

  override def onBatchCompleted(batchCompleted: 
StreamingListenerBatchCompleted): Unit = {
synchronized {
  waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
  
{color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
  val batchUIData = BatchUIData(batchCompleted.batchInfo)
  completedBatchUIData.enqueue(batchUIData)
  if (completedBatchUIData.size > batchUIDataLimit) {
val removedBatch = completedBatchUIData.dequeue()
batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
  }
  totalCompletedBatches += 1L

  totalProcessedRecords += batchUIData.numRecords
}
}

What is the solution here? Should I make my spark streaming context remember 
duration a lot longer? ssc.remember(batchDuration * rememberMultiple)

Otherwise, it seems like there should be some kind of existence check on 
runningBatchUIData before dereferencing it.


  was:
My streaming application has 200+ output operations, many of them stateful and 
several of them windowed. In an attempt to reduce the processing times, I set 
"spark.streaming.concurrentJobs" to 2+. Initial results are very positive, 
cutting our processing time from ~3 minutes to ~1 minute, but eventually we 
encounter an exception as follows:

Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
batch from 45 minutes before the exception is thrown.

2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
StreamingJobProgressListener threw an exception
java.util.NoSuchElementException: key not found 149697756 ms
at scala.collection.MalLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
at 
org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
...


The Spark code

[jira] [Commented] (SPARK-21061) GMM Error : Matrix is not symmetric

2017-06-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046631#comment-16046631
 ] 

Sean Owen commented on SPARK-21061:
---

I get a slightly different error: NotConvergedException. The problem is that 
some inputs to eigSym are NaN. After tracing it back a while it's because 
math.exp(logpdf(x)) in pdf(x) becomes Infinite, and that's likely because sigma 
is near-singular, so rootSigmaInv has huge values. I stopped before figuring 
out exactly what's wrong there, but that's pretty much the root.

For your example, I think it's because you have too many cluster centers for 
this data. Try k=5. It shouldn't cause an error, but, I think you'll find you 
don't want nearly so many clusters anyway.

> GMM Error : Matrix is not symmetric
> ---
>
> Key: SPARK-21061
> URL: https://issues.apache.org/jira/browse/SPARK-21061
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Ubuntu 14.04, Windows
>Reporter: Sourav Mahato
>
> I am trying to implement Gaussian Mixture Model(GMM) using Apache Spark MLlib 
> (1.6.0). But I am getting the following error.
> https://stackoverflow.com/questions/44494599/spark-mllib-gmm-error-matrix-is-not-symmetric
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-12 Thread Peter Bykov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bykov updated SPARK-21063:

Affects Version/s: 2.1.0

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-12 Thread Peter Bykov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046596#comment-16046596
 ] 

Peter Bykov commented on SPARK-21063:
-

[~sowen] data available in table, also i have empty result if i will execute:
{code:java}
val df = spark.read
  .option("url", "jdbc:hive2://remote.hive.local:1/default")
  .option("user", "user")
  .option("password", "pass")
  .option("dbtable", "(SELECT 1 AS col) AS dft")
  .option("driver", "org.apache.hive.jdbc.HiveDriver")
  .format("jdbc")
  .load()
{code}
As you can see in my description, i receive table schema, but no data. This 
means that connection ok.
I'm pointing right cluster, i checked link thru simple JDBC connection.
Also, if i will execute query like: 
{code:java}
val spark = SparkSession.builder
  .appName("RemoteSparkTest")
  .master("local")
  .config("hive.metastore.uris", "thrift://remote.hive.local:9083/default")
  .enableHiveSupport()
  .getOrCreate()

val df = spark.sql("SELECT * FROM test_table")
df.show()
{code}

Result:
{noformat}
++
|test_col|
++
|test row|
++
{noformat}

So, data available.
 

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict

2017-06-12 Thread Dan Dutrow (JIRA)

Dan Dutrow created SPARK-21065:
--

 Summary: Spark Streaming concurrentJobs + 
StreamingJobProgressListener conflict
 Key: SPARK-21065
 URL: https://issues.apache.org/jira/browse/SPARK-21065
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Dan Dutrow


My streaming application has 200+ output operations, many of them stateful and 
several of them windowed. In an attempt to reduce the processing times, I set 
"spark.streaming.concurrentJobs" to 2+. Initial results are very positive, 
cutting our processing time from ~3 minutes to ~1 minute, but eventually we 
encounter an exception as follows:

Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a 
batch from 45 minutes before the exception is thrown.

2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR 
org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener 
StreamingJobProgressListener threw an exception
java.util.NoSuchElementException: key not found 149697756 ms
at scala.collection.MalLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.mutable.HashMap.apply(HashMap.scala:65)
at 
org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29)
at 
org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43)
...


The Spark code causing the exception is here:

https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125
  override def onOutputOperationCompleted(
  outputOperationCompleted: StreamingListenerOutputOperationCompleted): 
Unit = synchronized {
// This method is called before onBatchCompleted
{color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color}
  updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo)
}

It seems to me that it may be caused by that batch being removed earlier.
https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102

  override def onBatchCompleted(batchCompleted: 
StreamingListenerBatchCompleted): Unit = {
synchronized {
  waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime)
  
{color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color}
  val batchUIData = BatchUIData(batchCompleted.batchInfo)
  completedBatchUIData.enqueue(batchUIData)
  if (completedBatchUIData.size > batchUIDataLimit) {
val removedBatch = completedBatchUIData.dequeue()
batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime)
  }
  totalCompletedBatches += 1L

  totalProcessedRecords += batchUIData.numRecords
}
}

What is the solution here? Should I make my spark streaming context remember 
duration a lot longer? ssc.remember(batchDuration * rememberDuration)

Otherwise, it seems like there should be some kind of existence check on 
runningBatchUIData before dereferencing it.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046581#comment-16046581
 ] 

Sean Owen commented on SPARK-21063:
---

Is there data in the table? are you pointing at the right cluster? are there 
other errors? etc.

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-12 Thread Peter Bykov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046576#comment-16046576
 ] 

Peter Bykov commented on SPARK-21063:
-

[~srowen] what do you mean by wrong config? what configuration should i check?

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21041) With whole-stage codegen, SparkSession.range()'s behavior is inconsistent with SparkContext.range()

2017-06-12 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21041.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0
   2.2.1

> With whole-stage codegen, SparkSession.range()'s behavior is inconsistent 
> with SparkContext.range()
> ---
>
> Key: SPARK-21041
> URL: https://issues.apache.org/jira/browse/SPARK-21041
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Kris Mok
>Assignee: Dongjoon Hyun
> Fix For: 2.2.1, 2.3.0
>
>
> When whole-stage codegen is enabled, in face of integer overflow, 
> SparkSession.range()'s behavior is inconsistent with when codegen is turned 
> off, while the latter is consistent with SparkContext.range()'s behavior.
> The following Spark Shell session shows the inconsistency:
> {code:java}
> scala> sc.range
>def range(start: Long,end: Long,step: Long,numSlices: Int): 
> org.apache.spark.rdd.RDD[Long]
> scala> spark.range
>   
>
> def range(start: Long,end: Long,step: Long,numPartitions: Int): 
> org.apache.spark.sql.Dataset[Long]   
> def range(start: Long,end: Long,step: Long): 
> org.apache.spark.sql.Dataset[Long]  
> def range(start: Long,end: Long): org.apache.spark.sql.Dataset[Long]  
>
> def range(end: Long): org.apache.spark.sql.Dataset[Long] 
> scala> sc.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 2, 
> 1).collect
> res1: Array[Long] = Array()
> scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 
> 2, 1).collect
> res2: Array[Long] = Array(9223372036854775804, 9223372036854775805, 
> 9223372036854775806)
> scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
> scala> spark.range(java.lang.Long.MAX_VALUE - 3, java.lang.Long.MIN_VALUE + 
> 2, 1).collect
> res5: Array[Long] = Array()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite

2017-06-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046493#comment-16046493
 ] 

Apache Spark commented on SPARK-21064:
--

User 'djvulee' has created a pull request for this issue:
https://github.com/apache/spark/pull/18280

> Fix the default value bug in NettyBlockTransferServiceSuite
> ---
>
> Key: SPARK-21064
> URL: https://issues.apache.org/jira/browse/SPARK-21064
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: DjvuLee
>Assignee: DjvuLee
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21019) read orc when some of the columns are missing in some files

2017-06-12 Thread Mahesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046485#comment-16046485
 ] 

Mahesh commented on SPARK-21019:


Can you include the exception in your defect, from what I know spark uses orc 
implementation that is hosted at http://orc.apache.org/
You may want to try a similar thing with that library. I will try to run your 
code.

>  read orc when some of the columns are missing in some files
> 
>
> Key: SPARK-21019
> URL: https://issues.apache.org/jira/browse/SPARK-21019
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Modi Tamam
>
> I'm using Spark-2.1.1.
> I'm experiencing an issue when I'm reading a bunch of ORC files when some of 
> the fields are missing from some of the files (file-1 has fields 'a' and 'b', 
> file-2 has fields 'a' and 'c'). When I'm running the same flow with JSON 
> files format, every thing is just fine (you can see it at the code snippets , 
> if you'll run it...) My question is whether it's a bug or an expected 
> behavior?
> I'v pushed a maven project, ready for run, you can find it here 
> https://github.com/MordechaiTamam/spark-orc-issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite

2017-06-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21064:
-

   Assignee: DjvuLee
   Priority: Trivial  (was: Major)
Component/s: (was: Spark Core)
 Tests

> Fix the default value bug in NettyBlockTransferServiceSuite
> ---
>
> Key: SPARK-21064
> URL: https://issues.apache.org/jira/browse/SPARK-21064
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: DjvuLee
>Assignee: DjvuLee
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite

2017-06-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046478#comment-16046478
 ] 

Apache Spark commented on SPARK-21064:
--

User 'djvulee' has created a pull request for this issue:
https://github.com/apache/spark/pull/18279

> Fix the default value bug in NettyBlockTransferServiceSuite
> ---
>
> Key: SPARK-21064
> URL: https://issues.apache.org/jira/browse/SPARK-21064
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21064:


Assignee: (was: Apache Spark)

> Fix the default value bug in NettyBlockTransferServiceSuite
> ---
>
> Key: SPARK-21064
> URL: https://issues.apache.org/jira/browse/SPARK-21064
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21064:


Assignee: Apache Spark

> Fix the default value bug in NettyBlockTransferServiceSuite
> ---
>
> Key: SPARK-21064
> URL: https://issues.apache.org/jira/browse/SPARK-21064
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DjvuLee
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20927) Add cache operator to Unsupported Operations in Structured Streaming

2017-06-12 Thread Chenzhao Guo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046476#comment-16046476
 ] 

Chenzhao Guo commented on SPARK-20927:
--

What exactly is 'no-op' ? Does that mean scala `NotImplementedError`, or could 
you give an example?

> Add cache operator to Unsupported Operations in Structured Streaming 
> -
>
> Key: SPARK-20927
> URL: https://issues.apache.org/jira/browse/SPARK-20927
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> Just [found 
> out|https://stackoverflow.com/questions/42062092/why-does-using-cache-on-streaming-datasets-fail-with-analysisexception-queries]
>  that {{cache}} is not allowed on streaming datasets.
> {{cache}} on streaming datasets leads to the following exception:
> {code}
> scala> spark.readStream.text("files").cache
> org.apache.spark.sql.AnalysisException: Queries with streaming sources must 
> be executed with writeStream.start();;
> FileSource[files]
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:36)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:34)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:63)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:104)
>   at 
> org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:68)
>   at 
> org.apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:92)
>   at org.apache.spark.sql.Dataset.persist(Dataset.scala:2603)
>   at org.apache.spark.sql.Dataset.cache(Dataset.scala:2613)
>   ... 48 elided
> {code}
> It should be included in Structured Streaming's [Unsupported 
> Operations|http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite

2017-06-12 Thread DjvuLee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046475#comment-16046475
 ] 

DjvuLee commented on SPARK-21064:
-

The defalut value for `spark.port.maxRetries` is 100, but we use the 10
in the suite file.

> Fix the default value bug in NettyBlockTransferServiceSuite
> ---
>
> Key: SPARK-21064
> URL: https://issues.apache.org/jira/browse/SPARK-21064
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DjvuLee
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21064) Fix the default value bug in NettyBlockTransferServiceSuite

2017-06-12 Thread DjvuLee (JIRA)

DjvuLee created SPARK-21064:
---

 Summary: Fix the default value bug in 
NettyBlockTransferServiceSuite
 Key: SPARK-21064
 URL: https://issues.apache.org/jira/browse/SPARK-21064
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: DjvuLee






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21058) potential SVD optimization

2017-06-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046424#comment-16046424
 ] 

Sean Owen commented on SPARK-21058:
---

I think we discussed this separately. The Gramian method is indeed appropriate 
for tall, skinny matrices. Just because computing the Gramian is the hotspot 
doesn't mean there's a faster way. How would you efficiently compute the SVD of 
any huge matrix?

If the matrix isn't huge, you can pull it into memory and decompose it with a 
library, sure. But that's not a Spark problem.

There's also a different SVD implementation in GraphX.

No, I would not start work on anything until you've addressed the above. 

> potential SVD optimization
> --
>
> Key: SPARK-21058
> URL: https://issues.apache.org/jira/browse/SPARK-21058
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.1
>Reporter: Vincent
>
> In the current implementation, computeSVD will compute SVD for matrix A by 
> computing AT*A first and svd on the Gramian matrix, we found that the Gramian 
> matrix computation is the hot spot of the overall SVD computation. While svd 
> on the Gramian matrix can benefit svd computation on the skinny matrix, for a 
> non-skinny matrix, it could also become a huge overhead. So, is it possible 
> to offer another option by computing svd on the original matrix instead of 
> the Gramian matrix? We can decide which way to go by the ratio between 
> numCols and numRows, or by simply settings from the user.
> We have observed a handsome gain on a toy dataset by svd on the original 
> matrix instead of the Gramian matrix, if the proposal is acceptable, we will 
> start to work on the patch and gather more performance data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20947:


Assignee: Apache Spark

> Encoding/decoding issue in PySpark pipe implementation
> --
>
> Key: SPARK-20947
> URL: https://issues.apache.org/jira/browse/SPARK-20947
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 
> 2.1.1
>Reporter: Xiaozhe Wang
>Assignee: Apache Spark
> Attachments: fix-pipe-encoding-error.patch
>
>
> Pipe action convert objects into strings using a way that was affected by the 
> default encoding setting of Python environment.
> Here is the related code fragment (L717-721@python/pyspark/rdd.py):
> {code}
> def pipe_objs(out):
> for obj in iterator:
> s = str(obj).rstrip('\n') + '\n'
> out.write(s.encode('utf-8'))
> out.close()
> {code}
> The `str(obj)` part implicitly convert `obj` to an unicode string, then 
> encode it into a byte string using default encoding; On the other hand, the 
> `s.encode('utf-8')` part implicitly decode `s` into an unicode string using 
> default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string.
> Typically the default encoding of Python environment would be 'ascii', which 
> means passing  an unicode string containing characters beyond 'ascii' charset 
> will raise UnicodeEncodeError exception at `str(obj)` and passing a byte 
> string containing bytes greater than 128 will again raise UnicodeEncodeError 
> exception at 's.encode('utf-8')`.
> Changing `str(obj)` to `unicode(obj)` would eliminate these problems.
> The following code snippet reproduces these errors:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
> Using Python version 2.7.12 (default, Jul 25 2016 15:06:45)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
> [Stage 0:>  (0 + 4) / 
> 4]Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 719, in pipe_objs
> s = str(obj).rstrip('\n') + '\n'
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
> ordinal not in range(128)
> >>>
> >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: 
> >>> x.encode('utf-8')).pipe('cat').collect()
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 720, in pipe_objs
> out.write(s.encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: 
> ordinal not in range(128)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation

2017-06-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046420#comment-16046420
 ] 

Apache Spark commented on SPARK-20947:
--

User 'chaoslawful' has created a pull request for this issue:
https://github.com/apache/spark/pull/18277

> Encoding/decoding issue in PySpark pipe implementation
> --
>
> Key: SPARK-20947
> URL: https://issues.apache.org/jira/browse/SPARK-20947
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 
> 2.1.1
>Reporter: Xiaozhe Wang
> Attachments: fix-pipe-encoding-error.patch
>
>
> Pipe action convert objects into strings using a way that was affected by the 
> default encoding setting of Python environment.
> Here is the related code fragment (L717-721@python/pyspark/rdd.py):
> {code}
> def pipe_objs(out):
> for obj in iterator:
> s = str(obj).rstrip('\n') + '\n'
> out.write(s.encode('utf-8'))
> out.close()
> {code}
> The `str(obj)` part implicitly convert `obj` to an unicode string, then 
> encode it into a byte string using default encoding; On the other hand, the 
> `s.encode('utf-8')` part implicitly decode `s` into an unicode string using 
> default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string.
> Typically the default encoding of Python environment would be 'ascii', which 
> means passing  an unicode string containing characters beyond 'ascii' charset 
> will raise UnicodeEncodeError exception at `str(obj)` and passing a byte 
> string containing bytes greater than 128 will again raise UnicodeEncodeError 
> exception at 's.encode('utf-8')`.
> Changing `str(obj)` to `unicode(obj)` would eliminate these problems.
> The following code snippet reproduces these errors:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
> Using Python version 2.7.12 (default, Jul 25 2016 15:06:45)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
> [Stage 0:>  (0 + 4) / 
> 4]Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 719, in pipe_objs
> s = str(obj).rstrip('\n') + '\n'
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
> ordinal not in range(128)
> >>>
> >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: 
> >>> x.encode('utf-8')).pipe('cat').collect()
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 720, in pipe_objs
> out.write(s.encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: 
> ordinal not in range(128)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20947) Encoding/decoding issue in PySpark pipe implementation

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20947:


Assignee: (was: Apache Spark)

> Encoding/decoding issue in PySpark pipe implementation
> --
>
> Key: SPARK-20947
> URL: https://issues.apache.org/jira/browse/SPARK-20947
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 
> 2.1.1
>Reporter: Xiaozhe Wang
> Attachments: fix-pipe-encoding-error.patch
>
>
> Pipe action convert objects into strings using a way that was affected by the 
> default encoding setting of Python environment.
> Here is the related code fragment (L717-721@python/pyspark/rdd.py):
> {code}
> def pipe_objs(out):
> for obj in iterator:
> s = str(obj).rstrip('\n') + '\n'
> out.write(s.encode('utf-8'))
> out.close()
> {code}
> The `str(obj)` part implicitly convert `obj` to an unicode string, then 
> encode it into a byte string using default encoding; On the other hand, the 
> `s.encode('utf-8')` part implicitly decode `s` into an unicode string using 
> default encoding and then encode it (AGAIN!) into a UTF-8 encoded byte string.
> Typically the default encoding of Python environment would be 'ascii', which 
> means passing  an unicode string containing characters beyond 'ascii' charset 
> will raise UnicodeEncodeError exception at `str(obj)` and passing a byte 
> string containing bytes greater than 128 will again raise UnicodeEncodeError 
> exception at 's.encode('utf-8')`.
> Changing `str(obj)` to `unicode(obj)` would eliminate these problems.
> The following code snippet reproduces these errors:
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 1.6.3
>   /_/
> Using Python version 2.7.12 (default, Jul 25 2016 15:06:45)
> SparkContext available as sc, HiveContext available as sqlContext.
> >>> sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()
> [Stage 0:>  (0 + 4) / 
> 4]Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 719, in pipe_objs
> s = str(obj).rstrip('\n') + '\n'
> UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: 
> ordinal not in range(128)
> >>>
> >>> sc.parallelize([u'\u6d4b\u8bd5']).map(lambda x: 
> >>> x.encode('utf-8')).pipe('cat').collect()
> Exception in thread Thread-1:
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 801, in __bootstrap_inner
> self.run()
>   File 
> "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py",
>  line 754, in run
> self.__target(*self.__args, **self.__kwargs)
>   File 
> "/Users/wxz/Downloads/spark-1.6.3-bin-hadoop2.6/python/pyspark/rdd.py", line 
> 720, in pipe_objs
> out.write(s.encode('utf-8'))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: 
> ordinal not in range(128)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21057) Do not use a PascalDistribution in countApprox

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21057:


Assignee: (was: Apache Spark)

> Do not use a PascalDistribution in countApprox
> --
>
> Key: SPARK-21057
> URL: https://issues.apache.org/jira/browse/SPARK-21057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>
> I was reading the source of Spark, and found this:
> https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72
> This is the function that estimates the probability distribution of the total 
> count of elements in an RDD given the count of only some partitions.
> This function does a strange thing: when the number of elements counted so 
> far is less than 10 000, it models the total count with a negative binomial 
> (Pascal) law, else, it models it with a Poisson law.
> Modeling our number of uncounted elements with a negative binomial law is 
> like saying that we ran over elements, counting only some, and stopping after 
> having counted a given number of elements.
> But this does not model what really happened.  Our counting was limited in 
> time, not in number of counted elements, and we can't count only some of the 
> elements in a partition.
> I propose to use the Poisson distribution in every case, as it can be 
> justified under the hypothesis that the number of elements in each partition 
> is independent and follows a Poisson law.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21057) Do not use a PascalDistribution in countApprox

2017-06-12 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21057:


Assignee: Apache Spark

> Do not use a PascalDistribution in countApprox
> --
>
> Key: SPARK-21057
> URL: https://issues.apache.org/jira/browse/SPARK-21057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>Assignee: Apache Spark
>
> I was reading the source of Spark, and found this:
> https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72
> This is the function that estimates the probability distribution of the total 
> count of elements in an RDD given the count of only some partitions.
> This function does a strange thing: when the number of elements counted so 
> far is less than 10 000, it models the total count with a negative binomial 
> (Pascal) law, else, it models it with a Poisson law.
> Modeling our number of uncounted elements with a negative binomial law is 
> like saying that we ran over elements, counting only some, and stopping after 
> having counted a given number of elements.
> But this does not model what really happened.  Our counting was limited in 
> time, not in number of counted elements, and we can't count only some of the 
> elements in a partition.
> I propose to use the Poisson distribution in every case, as it can be 
> justified under the hypothesis that the number of elements in each partition 
> is independent and follows a Poisson law.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21057) Do not use a PascalDistribution in countApprox

2017-06-12 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046402#comment-16046402
 ] 

Apache Spark commented on SPARK-21057:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18276

> Do not use a PascalDistribution in countApprox
> --
>
> Key: SPARK-21057
> URL: https://issues.apache.org/jira/browse/SPARK-21057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lovasoa
>
> I was reading the source of Spark, and found this:
> https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72
> This is the function that estimates the probability distribution of the total 
> count of elements in an RDD given the count of only some partitions.
> This function does a strange thing: when the number of elements counted so 
> far is less than 10 000, it models the total count with a negative binomial 
> (Pascal) law, else, it models it with a Poisson law.
> Modeling our number of uncounted elements with a negative binomial law is 
> like saying that we ran over elements, counting only some, and stopping after 
> having counted a given number of elements.
> But this does not model what really happened.  Our counting was limited in 
> time, not in number of counted elements, and we can't count only some of the 
> elements in a partition.
> I propose to use the Poisson distribution in every case, as it can be 
> justified under the hypothesis that the number of elements in each partition 
> is independent and follows a Poisson law.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046395#comment-16046395
 ] 

Sean Owen commented on SPARK-21063:
---

Nothing about this suggests a Spark problem. You may not have the data you 
think; you may have the wrong config. This works fine on my cluster and my data.

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 123 matches

Mail list logo